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1  Introduction 


The  cost  of  an  algorithm  is  a  function  of  both  computation ,  the  number  of  arithmetic  op¬ 
erations  performed,  and  communication ,  the  amount  of  data  movement.  Communication 
cost  encapsulates  both  data  movement  between  levels  of  the  memory  hierarchy  and  between 
processors,  and  the  number  of  messages  in  which  the  data  is  sent.  In  terms  of  performance, 
communication  costs  are  much  greater  than  computation  costs  on  modern  computer  archi¬ 
tectures,  and  the  gap  is  only  expected  to  widen  in  future  systems.  Therefore,  in  order  to 
increase  the  performance  of  an  algorithm,  we  must  turn  to  strategies  to  minimize  commu¬ 
nication  rather  than  try  to  decrease  the  number  of  arithmetic  operations.  We  call  this  a 
“communication-avoiding”  (CA)  approach  to  algorithmic  design. 

Many  scientific  applications  require  codes  which  solve  linear  systems.  Methods  for  solving 
linear  systems  can  be  classified  either  as  direct  methods,  which  perform  a  fixed  number  of 
steps  to  find  a  solution,  or  iterative  methods,  which  repeatedly  refine  a  candidate  solution 
until  acceptable  convergence  is  reached.  Such  iterative  methods  are  commonly  used  when 
the  matrix  is  too  large  to  be  solved  with  direct  methods,  when  the  matrix  is  sparse,  or 
when  an  error  tolerance  greater  than  machine  precision  is  acceptable.  The  most  general  and 
flexible  class  of  iterative  methods  are  Krylov  Subspace  Methods  (KSMs).  These  methods  are 
based  on  projection  onto  expanding  subspaces,  where,  in  each  iteration  s,  the  “best”  solution 
is  chosen  from  the  expanding  Krylov  subspace,  tC(s ,  A,  v )  =  spanju,  Av,  A2v, ...,  As~lv}. 
Numerous  variants  of  Krylov  subspace  methods  exist,  each  with  different  properties,  different 
storage  requirements,  and  different  methods  of  selecting  the  “best”  new  solution. 

Standard  implementations  of  KSMs  require  one  or  more  Sparse  Matrix-Vector  Multi¬ 
plications  (SpMVs)  and  one  or  more  vector  operations  in  each  iteration.  These  are  both 
communication-bound  operations.  To  perform  an  SpMV,  each  processor  must  communicate 
entries  of  the  source  vector  to  other  processors  in  the  parallel  algorithm,  and  A  must  be  read 
from  slow  memory  in  the  sequential  algorithm.  Vector  operations,  such  as  dot  products, 
involve  a  global  reduction  in  the  parallel  algorithm,  and  a  number  of  reads  and  writes  to 
slow  memory  in  the  sequential  algorithm  (depending  on  the  size  of  the  vectors  and  the  size  of 
the  fast  memory).  The  goal  of  Communication- Avoiding  KSMs  (CA-KSMs)  is  to  break  the 
dependency  between  SpMV  and  vector  operations  in  each  iteration,  enabling  us  to  perform 
s  iterations  for  the  communication  cost  of  1  +  0(1)  iterations. 

Hoemmen,  et  al.  have  previously  derived  and  tested  algorithms  for  communication- 
avoiding  implementations  of  both  the  Generalized  Minimal  Residual  Method  (CA-GAIRES) 
and  the  Conjugate  Gradient  Method  (CA-CG)  [20,  27].  Communication-avoiding  direct 
methods  have  also  been  implemented  and  analyzed  -  for  a  good  overview,  see  [3].  In  our  work, 
we  focus  specifically  on  two-sided  KSMs,  which,  implicitly  or  explicitly,  use  two  different 
Krylov  subspaces  -  one  for  the  “right-hand”  space,  spa,n{u,  Av,  A2v,...},  and  one  for  the 
“left-hand”  space,  span{w,  AHw,  ( AH)2w ,  ...},  where  AH  denotes  the  conjugate  transpose 
of  A.  Two-sided  KSMs  are  specifically  suited  for  nonsymmetric  linear  systems,  which  arise 
frequently  in  many  scientific  domains. 

The  approach  to  avoiding  communication  in  KSMs  involves  two  communication-avoiding 
kernels  -  the  matrix  powers  kernel  (Akx),  and  the  Tall-Skinny  QR  kernel  (TSQR).  By  using 
communication-avoiding  kernels,  we  are  able  to  reduce  parallel  latency,  and  sequential  band¬ 
width  and  latency  costs  in  a  KSM  by  a  factor  of  0(s).  If  we  are  communication-bound,  this 
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suggests  up  to  a  factor  of  s  speedup,  which  is  desirable  even  for  small  s. 

Given  a  matrix  A ,  a  vector  v,  and  a  basis  length  s,  the  matrix  powers  kernel  computes  the 
basis  vectors  \pq(A)v,  p±(A)v,  p2(A)vi  ...,  ps-i(A)v\,  where  pj  is  a  polynomial  of  degree  j.  If 
A  is  well-partitioned  (i.e.,  A  has  a  low  surface-to-volume  ratio,  see  [12]),  the  s— dimensional 
basis  can  be  computed  only  reading  A  (or  communicating  entries  of  v  in  the  parallel  case) 
1  +  0(1)  times.  Therefore,  to  perform  s  iterations  of  the  KSM,  we  only  need  to  communicate 
1  +  0(1)  times  as  opposed  to  s  times  (s  SpMV  operations)  in  the  standard  implementation. 
This  idea  has  been  discussed  in  the  literature  for  the  stencil  case,  dating  back  to  Hong  and 
Ivung’s  red-blue  pebble  game  [21].  Our  implementation  uses  the  same  approach,  but  also 
extends  to  general  graphs,  with  the  only  requirement  that  they  remain  well-partitioned. 

The  TSQR  kernel  is  useful  for  avoiding  communication  in  KSMs  which  require  explicit 
orthogonalization  (e.g.,  GMRES).  The  TSQR  operation  is  performed  on  the  output  of  the 
matrix  powers  kernel,  which  is  a  tall-skinny  matrix  ( 0(n )  x  0(s)),  to  orthogonalize  the 
basis.  This  replaces  the  Modified  Gram-Schmidt  operation  in  standard  GMRES,  reducing 
communication  by  a  factor  of  s  (see  [12,  20,  27]  for  details). 

In  addition  to  the  matrix  powers  kernel  and  TSQR,  our  communication-avoiding  imple¬ 
mentations  also  require  an  additional  kernel  to  compute  the  Gram-like  matrix,  VTV,  where 
V  and  V  are  O(n)  x  0(s)  matrices  of  Krylov  basis  vectors  (the  output  of  the  matrix  powers 
kernel).  We  do  not  discuss  implementation  of  this  kernel  further,  as  this  operation  can  be 
performed  in  a  straightforward  block-wise  fashion  so  as  to  minimize  communication. 

Our  primary  contributions  are  as  follows: 

•  We  have  derived  three  new  two-sided  Communication-Avoiding  Krylov  Subspace  Meth¬ 
ods  (CA-IvSMs):  Biconjugate  Gradient  (CA-BICG),  Conjugate  Gradient  Squared  (CA- 
CGS),  and  Biconjugate  Gradient  Stabilized  (CA-BICGSTAB).  These  are  mathemat¬ 
ically,  but  not  numerically,  equivalent  to  the  standard  implementations,  in  the  sense 
that  after  every  s  steps,  they  produce  an  identical  solution  as  the  conventional  al¬ 
gorithm  in  exact  arithmetic.  We  give  algorithms  for  two-term  recurrence  versions  of 
each  method.  We  also  comment  on  3-term  variants  for  these  methods,  which  are  more 
susceptible  to  round  off  error,  but  may  be  attractive  if  computation  costs  are  too  high 
or  storage  is  limited.  We  give  an  example  derivation  for  BIGG  (CA-BICG3). 

•  Our  CA-KSMs  handle  complex  inputs,  and  preconditioning  in  the  s-step  basis.  In 
order  to  remain  communication-avoiding,  the  preconditioned  matrix  M^1AA/IIi1  must 
be  well-partitioned.  Since  preconditioning  serves  to  reduce  the  condition  number  of  A, 
we  only  expect  this  to  improve  the  quality  of  our  s-step  bases. 

•  We  provide  convergence  results  for  our  methods  on  a  set  of  small  test  matrices.  The 
monomial  basis  quickly  becomes  ill-conditioned  for  larger  basis  sizes,  which  can  slow 
down  or  prevent  convergence.  We  derive  the  necessary  recurrences  and  change-of-basis 
matrices  for  both  the  Newton  basis  and  the  scaled  and  shifted  Chebyshev  basis.  Both 
of  these  recurrences  are  known  to  produce  bases  which  can  be  much  better  conditioned 
than  the  monomial  basis,  thus  preserving  convergence  for  higher  s  values.  We  describe 
methods  for  obtaining  good  eigenvalue  estimates  in  practice,  used  for  construction  of 
both  these  bases.  Using  these  more  stable  bases,  we  are  able  to  maintain  stability 
(compared  against  the  standard  implementation)  for  basis  lengths  as  high  as  s  = 
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20.  Note  that  if  the  algorithm  is  communication-bound,  even  basis  length  s  —  2 
theoretically  results  in  2  times  less  communication  than  the  standard  algorithm,  and 
thus  should  yield  a  2x  speedup.  We  also  explore  techniques  such  as  orthogonalization 
of  the  basis,  restarting,  and  deflation. 

•  We  discuss  implementation  details  for  these  methods,  specifically  for  variants  of  the 
matrix  powers  kernel.  Our  algorithms  can  exploit  either  multiple  right-hand-sides 
(RHSs)  or  computations  with  AH  as  well  as  A,  which  both  present  opportunities  for 
data  reuse  in  certain  KSMs.  We  also  discuss  future  opportunities  for  auto-tuning,  such 
as  choosing  a  basis  and  a  basis  size,  which  must  be  selected  to  achieve  both  performance 
and  stability. 

In  all  subsequent  sections  of  this  paper,  we  use  BICG,  CGS,  and  BICGSTAB  to  refer  to  the 
standard  implementations  of  these  KSMs  (as  described  in  [33,  35,  39],  respectively),  and  CA- 
BICG,  CA-CGS,  and  CA-BICGSTAB  to  refer  to  our  communication-avoiding  algorithms. 

2  Related  Work 

There  is  a  wealth  of  work  described  in  the  literature  related  to  s-step  KSMs  and  the  idea  of 
avoiding  communication.  A  thorough  overview  is  given  in  [20],  which  we  summarize  here. 

2.1  Related  Work  in  s— step  Methods 

The  first  instance  of  an  s— step  method  in  the  literature  is  Van  Rosendale’s  conjugate  gradient 
method  [31].  Van  Rosendale’s  implementation  was  motivated  by  exposing  more  parallelism 
using  the  PRAM  model.  Chronopoulous  and  Gear  later  created  an  s— step  GMRES  method 
with  the  goal  of  exposing  more  parallel  optimizations  [9].  Walker  looked  into  s-step  bases 
as  a  method  for  improving  stability  in  GMRES  by  replacing  the  modified  Gram-Schmidt 
orthogonalization  process  with  Householder  QR  [41],  All  these  authors  used  the  monomial 
basis,  and  found  that  convergence  often  could  not  be  guaranteed  for  s  >  5.  It  was  later 
discovered  that  this  behavior  was  due  to  the  inherent  instability  of  the  monomial  basis, 
which  motivated  research  into  the  use  of  other  bases  for  the  Krylov  Subspace. 

Hindmarsh  and  Walker  used  a  scaled  (normalized)  monomial  basis  to  improve  conver¬ 
gence  [19],  but  only  saw  minimal  improvement.  Joubert  and  Carey  implemented  a  scaled 
and  shifted  Chebyshev  basis  which  provided  more  accurate  results  [23].  Bai  et  al.  also  saw 
improved  convergence  using  a  Newton  basis  [1].  Constructing  other  bases  for  the  Krylov 
Subspace  will  be  covered  more  thoroughly  in  Section  4. 

Although  successively  scaling  the  basis  vectors  serves  to  lower  the  condition  number  of 
the  basis  matrix,  hopefully  yielding  convergence  closer  to  that  of  the  standard  method,  this 
computation  reintroduces  the  dependency  we  sought  to  remove,  hindering  communication- 
avoidance.  Hoemmen  resolves  this  problem  using  a  novel  matrix  equilibration  and  balancing 
approach  as  a  preprocessing  step,  which  eliminates  the  need  for  scaled  basis  vectors  [20]. 
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2.2  Related  Work  in  Avoiding  Communication  in  KSMs 

Our  work  is  most  closely  related  to  that  of  [12,  20,  27].  Although  their  efforts  focused  on  Lanc- 
zos,  Arnoldi,  CG  and  GMRES,  the  method  by  which  we  have  derived  our  communication¬ 
avoiding  algorithms  is  closely  related.  In  addition  to  working  with  different  Krylov  Subspace 
Methods  (BICG,  CGS,  and  BICGSTAB),  we  also  make  contributions  in  improving  conver¬ 
gence  and  identify  further  optimizations  for,  and  variants  of,  the  matrix  powers  kernel. 

There  have  been  many  efforts  to  avoid  communication  in  Krylov  Subspace  Methods  in 
the  past  which  differ  from  our  approach.  These  can  be  categorized  as  follows: 

•  Reducing  synchronization  cost 

-  Replacing  Modified  Gram-Schmidt  with  Classical  Gram-Schmidt,  although  con¬ 
sidered  to  be  less  stable,  requires  fewer  synchronization  points  in  methods  which 
require  explicit  orthogonalization  [18]. 

—  Asynchronous  iterations  relax  synchronization  constraints  to  reduce  data  depen¬ 
dencies,  but  lead  to  nondeterministic  behavior  (see,  e.g.,  [8,  5,  6,  13,  42]). 

•  Increasing/Exploiting  Temporal  Locality 

-  Block  Krylov  Methods  can  be  used  to  solve  many  systems  at  a  time  (multiple 
RHSs  for  the  same  matrix  A)  (see,  e.g.,  [10,  38,  15,  28,  26,  2]). 

-  Blocking  Covers  can  be  used  to  reformulate  the  KSM  to  exploit  temporal  locality 
by  blocking  dot  products.  This  work  is  a  direct  precursor  to  our  methods.  The 
same  approach  can  be  applied  to  avoid  communication  in  multigrid  [37,  25]. 

•  Altering  the  Krylov  Method 

—  Chebyshev  Iteration  is  an  iterative  method  (but  not  an  s— step  method)  which 
requires  no  inner  products,  and  only  one  SpMV  per  iteration.  Removing  the  global 
communication  requirement  has  advantages  in  performance,  but  disadvantages  in 
terms  of  convergence,  as  information  can’t  travel  as  quickly  (see,  e.g.,  [4]). 


3  Derivation  of  Algorithms 

We  will  reorganize  three  BICG-like  methods  -  BICG,  CGS,  and  BICGSTAB  -  to  avoid 
communication.  In  exact  arithmetic,  our  communication-avoiding  variants  will  get  the  same 
answer  as  the  original  versions.  In  finite  precision,  our  algorithms  accumulate  different 
rounding  errors,  and  the  iterates  may  diverge  significantly.  Our  communication-avoiding 
implementation  is  based  on  an  s-step  formulation  of  the  original  algorithm.  This  means  we 
explicitly  extend  each  of  the  underlying  Krylov  spaces  by  s  dimensions  and  then  perform  s 
steps  of  Lanczos  biorthogonalization,  or  something  closely  related. 

Each  of  these  steps  normally  requires  O(s)  parallel  synchronization  points  (rounds  of  mes¬ 
sages),  but  we  will  use  communication-avoiding  kernels:  the  matrix  powers  kernel  instead  of 
s  SpMVs  and  computing  a  Gram-like  matrix  to  replace  inner  products.  One  step  of  the  s-step 
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method  now  involves  only  two  synchronizations.  We  say  this  approach  is  communication¬ 
avoiding  because  we  reduce  the  latency  cost  asymptotically,  by  a  factor  of  0(s).  However, 
the  ability  to  avoid  communication  in  the  sparse  portion  is  constrained  by  the  structure  of 
As:  for  example,  if  As ,  s  >  1  has  diameter  s  —  1,  one  parallel  processor  or  cache  block  will 
need  a  copy  of  the  entire  matrix  and  source  vector.  Except  for  block  diagonal  matrices  and 
assuming,  as  always,  no  cancellation  in  powers  of  A ,  we  expect  s  to  remain  modest,  say 
less  than  30.  Thus  in  practice,  the  constants  in  our  big-0  notation  may  be  relevant.  The 
additional  costs  of  the  communication-avoiding  biorthogonalization  kernel  grow  slowly  with 
s,  exceeding  the  standard  algorithm  by  factors  of  O(s)  more  flops  and  words  moved. 

The  main  concern  with  s-step  methods  is  how  to  compute  the  s-step  basis  stably  in 
practice.  The  sequence  of  vectors  (x,  Ax,  A2x, . . .),  which  is  the  monomial  basis  of  a  Krylov 
space,  typically  becomes  linearly  dependent  in  finite  precision  faster  than  in  exact  arithmetic 
(when  it  does  so  always  before  An+lx ).  This  happens  when  the  sequence  converges  to  an 
eigenvector  of  A ,  as  in  the  power  method.  We  cannot  hope  to  find  s  basis  vectors  of  a  space 
of  dimension  less  than  s,  yet  this  is  exactly  what  we  may  ask  our  algorithm  to  find. 

If  A  is  normal,  we  can  use  spectral  information  to  find  Newton  and  Chebyshev  polynomial 
bases  that  are  linearly  independent  for  greater  s  values.  In  practice,  we  expect  matrices 
that  are  roughly  normal,  i.e.,  relatively  few  defective  eigenvalues,  to  also  benefit  from  these 
polynomials.  Unfortunately,  when  A  is  highly  nonnormal,  the  analogy  with  polynomial 
interpolation  breaks  down  and  we  can  no  longer  bound  the  condition  number  of  our  s- 
step  bases  by  the  condition  number  of  a  polynomial  defined  on  the  spectrum  of  A  (in  fact, 
convergence  theory  for  the  standard  algorithms  breaks  down  as  well).  Attempts  to  instead 
use  the  field  of  values  of  A  or  the  convex  hull  of  the  spectrum  have  been  unsuccessful  in 
general.  Although  we  only  discuss  the  Newton  and  Chebyshev  polynomials,  our  presentation 
assumes  only  that  a  polynomial  satisfy  a  three-term  recurrence,  which  is  general  enough  for 
all  the  classical  orthogonal  polynomials. 

All  algorithms  we  present  assume  complex-valued  inputs,  and  we  will  not  discuss  the 
straightforward  simplifications  that  arise  for  real  data. 

3.1  Two-term  vs.  three-term  recurrences 

All  of  the  BICG-like  methods  we  consider  originate  with  the  nonsymmetric  Lanczos  process, 
and  so  the  vector  iterates  always  satisfy  two-  or  three-term  recurrences.  In  the  case  of  the 
BICG  algorithm,  we  consider  both  the  BIOMIN  (two  pairs  of  coupled  two-term  recurrences) 
and  the  BIORES  (one  pair  of  uncoupled  three-term  recurrences)  forms. 

In  finite  precision,  the  BICG-like  methods  we  consider  demonstrate  a  deviation  between 
the  computed  residual  and  true  residual.  This  is  because  the  candidate  solution  and  residual 
are  updated  independently,  with  different  rounding  errors.  Usually  this  destroys  convergence: 
as  the  computed  residual  norm  decreases,  the  corrections  to  the  true  residual  become  smaller, 
causing  the  true  residual  to  stagnate  or  worsen.  This  limits  the  maximum  attainable  accuracy 
of  the  method,  which  we  want  to  be  independent  of  the  data.  It  is  demonstrated  in  [17] 
that  this  instability  can  be  worse  for  three-term  recurrences  than  for  two-  term  recurrences. 
We  did  not  generalize  this  error  analysis  to  our  communication-avoiding  formulations,  but 
conjecture  that  the  same  conclusion  holds. 

We  note  two  algorithmic  approaches  to  cope  with  this  problem  of  round  off  accumulation. 
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The  first  approach  is  to  substitute  the  true  residual  rm  —  b  —  Axm  on  iteration  m,  in  place 
of  the  computed  residual.  The  authors  of  [34]  have  formalized  the  problem  of  dynamically 
selecting  m  and  minimizing  the  impact  of  such  a  substitution  on  the  underlying  Lanczos 
process.  For  our  convergence  studies,  we  monitored  the  true  residual  for  stagnation,  chose 
the  corresponding  iterations  m ,  and  reran  the  method  with  these  statically  selected  restart 
indices  m.  A  dynamic  restarting  approach  for  CA-KSMs  is  future  work. 

The  second  algorithmic  approach  is  based  on  the  observation  that  BICG-like  methods 
unnecessarily  restrict  the  Lanczos  vectors  to  be  the  residuals;  this  is  called  enforcing  con¬ 
sistency.  An  inconsistent  formulation  [16]  computes  scalar  multiples  of  the  residuals,  and 
the  flexibility  to  arbitrarily  choose  these  scaling  factors  gives  more  control  over  the  sizes 
of  the  iterates  and  scalar  coefficients,  and  thus  the  bounds  on  round  off  error.  Relaxing 
consistency  also  enables  us  to  avoid  pivot  breakdowns  in  three-term  recurrence  variants  of 
BICG  and  CGS.  The  inconsistent  formulations  of  classical  BIGG  and  CGS  incur  no  more 
communication  or  computation  asymptotically  and  involve  only  slight  modifications  to  the 
consistent  versions.  The  same  is  true  for  communication-avoiding  inconsistent  formulations 
and  we  give  an  example  for  inconsistent  BIORES,  below.  This  preventative  measure  does 
not  eliminate  the  possibility  of  stagnation  -  it  only  attempts  to  postpone  it  -  thus  restarting 
might  still  be  necessary.  For  this  reason,  we  only  considered  restarting  in  our  convergence 
studies  and  consider  inconsistent  formulations  a  future  direction. 

Although  they  have  different  stability  properties,  two-term  and  three-term  recurrence 
variants  have  similar  computation  communication,  and  storage  costs.  This  might  be  sur¬ 
prising:  in  the  case  of  BICG,  the  two-term  variant  BIOMIN  produces  four  Krylov  bases 
while  the  three-term  variant  BIORES  produces  only  two,  yet  both  versions  perform  the 
same  number  of  SpMV  operations  per  iteration.  The  BIOMIN  variant  couples  the  search 
directions  and  the  residuals  in  order  to  advance  four  Krylov  spaces  with  two  SpMV  opera¬ 
tions.  Unfortunately,  in  order  to  avoid  communication,  we  must  break  the  data  dependency 
introduced  by  this  coupling.  Thus,  our  communication-avoiding  BIOMIN  implementation 
explicitly  constructs  all  four  Krylov  bases,  at  the  cost  of  doubling  sparse  flops  and  words 
moved  in  parallel.  Our  communication-avoiding  BIORES  implementation  avoids  doubling 
the  sparse  flops,  but  instead  requires  maintaining  a  number  of  n-vectors  proportional  to  the 
number  of  steps  we  wish  to  take  without  communication  (s). 

We  chose  to  implement  and  experiment  with  two-term  recurrence,  consistent  formulations 
of  BICG,  CGS,  and  BICGSTAB,  in  order  to  compare  convergence  with  MATLAB  BICG, 
CGS,  and  BICGSTAB.  We  discuss  three-term  and  inconsistent  formulations  to  suggest  that 
the  communication-avoiding  algorithm  design  space  offers  flexibility  to  trade  off  stability  and 
performance. 

3.2  Communication-avoiding  biconjugate  gradient  method  (CA-BICG) 

We  present  derivations  for  four  variants  of  communication-avoiding  BICG  (CA-BICG): 

•  CA-BIOMIN  (consistent  version),  Alg.  3 

•  CA-BIOMIN  (inconsistent  version),  Alg.  4 

•  CA-BIORES  (consistent  version),  Alg.  7 
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•  CA-BIORES  (inconsistent  version),  Alg.  8 

The  names  BIOMIN  and  BIORES  refer  to  the  two-term-  and  three-term-recurrence  variants 
of  the  classical  BIGG  method,  which  we  present  in  Algs.  1,  2,  5,  and  6.  We  referenced  [16] 
for  these  algorithms,  as  well  as  the  naming  conventions. 

Our  numerical  experiments  were  performed  with  consistent  CA-BIOMIN,  Alg.  3,  which 
is  closest  to  what  is  implemented  in  MATLAB  BIGG.  The  other  variants  have  different 
stability  properties  and  performance  characteristics,  and  we  present  them  as  directions  for 
future  work. 


3.2.1  Consistent  and  inconsistent  BIOMIN 

We  present  two  versions  of  two-term  classical  BICG.  [16]  refers  to  this  as  algorithm  as 
BIOMIN,  and  we  will  do  the  same.  The  standard  version,  as  in  MATLAB  BICG  is  due  to 
[14],  and  is  consistent  -  see  Alg.  1  .  We  also  present  an  inconsistent  variant  in  Alg.  2. 


Algorithm  1  Consistent  BIOMIN 


Require:  Initial  approximation  xq  for  solving  Ax  =  b. 


1: 

2: 

3 

4: 

5: 

6 

7: 

8: 

9: 

10: 

11 

12: 

13 

14: 

15: 

16 

17: 

18: 


Compute  p0  =  r0  =  (b  —  Ax 0). 

Choose  p0  =  f0  so  that  <50  =  ufio  7^  0  and  d'0  =  PqApq  ^  0. 
repeat  for  m  —  0, 1, . . ., 

Set  5'm  =  p^Apm.  If  5'm  =  0,  declare  pivot  breakdown  and  STOP. 
Take  steps  along  search  directions: 

^ m  bynj  drn- 
T m+1  ^ m.Apm ■ 

fm+ 1  =  An  -  UXAAHpm. 

Update  candidate  solution:  xm+i  =  xm  +  ujmpm. 

Set  Sm+ 1  =  f^+1rm+1.  If  Sm+ 1  =  0,  STOP. 

If  rm+ 1  =  0,  terminate  with  xex  =  xm+\. 

Otherwise,  declare  Lanczos  breakdown. 

Update  search  directions: 

1pm  An+l/ dm- 

Pm+ 1  =  f- m+1  IpmPm- 
Pm+ 1  I’m+l  IpmPm- 


until 


r  m+1 1 


is  small. 


Terminate  with  approximate  solution  x  ~  xm+i 


3.2.2  CA-BIOMIN 


We  first  consider  the  consistent  BIOMIN  algorithm  in  Alg.  1.  Starting  at  iteration  m,  we 
must  identify  the  data  dependencies  through  the  end  of  iteration  m  +  s  —  1.  These  data 
dependencies  are  the  four  s-step  bases 


[P,R] 

Ad 


[Pmj  Apm i  ■  ■  ■ ;  A  pm ,  r m,  Ar m, ^  A  rm] 

\pm,  AHpm, ...,  (AHY  pm,  rm,  AHfm , . . . ,  (AHy  rm] 
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Algorithm  2  Inconsistent  BIOMIN 

Require:  Initial  approximation  xo  for  solving  Ax  =  b. 

1:  Compute  po  =  r0  —  (b  —  Ax o)  /7_i  for  some  7_i  7^  0,  e.g.  so  ||r0||  =  1. 
2:  Redefine  xq  =  £0/7-1  ■ 

3:  Choose  po  =  r0  so  that  <50  =  f(f  r0  0  0  and  d'0  =  PoAp0  7^  0. 

4:  repeat  for  m  —  0, 1, . . 

5:  Set  5^  =  p^Apm.  If  5'm  =  0,  declare  pivot  breakdown  and  STOP. 

6:  Take  steps  along  search  directions: 

7:  0  m  d m /  Cn  • 

8:  Choose  7m  0  0  and  7m  0  0  arbitrarily. 

9:  ^m+l  (0m  (frmApm )  /7m- 

10:  r, m+1  =  (f. m  -  (f)mAHpm )  /7m- 

11:  7Tm+l  =  71"m0m/7mi  with  7To  —  1/7—1- 

12:  Update  candidate  solution:  £m+i  =  —  (0m^m  +  Pm)  /7m- 

13:  Set  5m+l  =  'fm+Rm+i-  If  <Wl  =  0,  STOP. 

14:  If  rm+i  =  0,  terminate  with  £ex  =  xm+i/7tm+i. 

15:  Otherwise,  declare  Lanczos  breakdown. 

16:  Update  search  directions: 

1 1 :  0m  =  7n0m+l/  §jrv 

18:  0m  =  7*)0m+l/  ^m- 

19-  Pm+l  ^m+ 1  0mPm- 

20:  Pm+l  f’m+l  0mPm- 

21:  until  ||rm+i||  /  |7rm+i|  is  small. 

22:  Terminate  with  approximate  solution  x  ~  im+i/^m+i- 
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and  the  scalar  coefficient  8m.  The  columns  of  these  matrices  span  Krylov  spaces.  We 
compute  s-step  bases  of  these  Krylov  spaces,  P,  R,  P,  R  ,  calling  the  matrix  powers  kernel. 
As  previously  mentioned,  we  might  choose  to  use  a  different  polynomial  basis  besides  the 
monomial  basis  for  stability  reasons  -  we  have  decorated  the  Akx  bases  with  carats.  For  ease 
of  presentation,  assume  we  use  the  same  polynomials  Pj(z),  0  <  j  <  s  for  all  four  bases,  and 
that  pj  has  nonzero  leading  coefficient  so  that  deg pj  =  j.  For  example,  P(’-,j)  =  Pj+i  ( A)  pm 

and  R(:,j)  =  Pj+ 1  ( AH )  fm.  Then  we  can  find  an  upper  triangular  s  +  lxs  +  1  change  of 
basis  matrix  B ,  to  recover  the  monomial  (or  Krylov)  bases,  according  to 

P,  R,  P,  R  •  hA  C3>  B  =  P,  R,  P ,  R 

We  discuss  the  choice  of  polynomials  and  computing  such  a  matrix  B  later. 

Since  all  iterates  in  iterations  m  :  m  +  s  can  be  expressed  as  a  linear  combination  of 
the  vectors  in  the  s-step  bases,  we  can  run  nonsymmetric  Lanczos  symbolically  with  the 
coefficient  vectors,  for  j  =  0  :  s  and  k  —  0  :  s  —  j, 

A  r]  ck3  =  Akr m+j  [A  A]  dk  =  ( AH ) fm+j 

p,  r\  A  =  Akpm+j  [A  r]  bk  =  ( AH ) pm+j 

to  represent  the  iterates  locally/in  fast  memory.  This  also  lets  us  compute  the  inner  products, 
although  not  symbolically  -  the  inner  products  are  encoded  in  the  entries  of  the  Gram-like 

r  £  --I  h  r  „ 

matrix  G  =  P,  R  P,P  and  recoverable  using  the  coefficient  vectors. 

At  this  point  we  simply  state  the  algorithm,  in  Alg.  3.  It  is  a  purely  mechanical  process 
to  generalize  this  result  to  the  inconsistent  BIOMIN  formulation  (Alg.  2),  which  we  present 
in  Alg.  4. 

3.2.3  Consistent  and  inconsistent  BIORES 

We  present  two  versions  of  three-term  BIGG  in  Algs.  5  and  6.  Gutknecht,  in  [16]  refers 
to  this  as  algorithm  as  BIORES,  and  we  will  do  the  same.  The  difference  between  the 
consistent  and  inconsistent  versions  is  the  same  as  with  BIOMIN,  except  additionally,  the 
consistent  version  is  able  to  avoid  the  pivot  breakdown  condition,  which  means  a  division 
by  zero  when  vanishes  or  perhaps  overflow  when  nearly  vanishes.  Pivot  breakdown 
is  due  to  a  zero  pivot,  when  (implicitly)  performing  Gaussian  elimination  on  the  (implicit) 
tridiagonal  matrix,  and  is  easily  avoided.  As  before,  the  additional  cost  is  negligible  -  we 
simply  maintain  an  extra  scalar  nm. 

[16]  also  presents  BIODIR,  another  3-term  variant,  which  constructs  a  set  of  biconjugate 
rather  than  biorthogonal  bases.  Biconjugate  means  biorthogonal  with  respect  to  the  A 
inner  product  -  because  A  is  not  necessarily  SPD,  this  is  a  formal  inner  product.  This 
variant  has  the  advantage  of  avoiding  Lanczos  breakdown  (by  stalling,  perhaps  indefinitely, 
in  the  case  of  an  incurable  breakdown),  but  is  still  susceptible  to  pivot  breakdown  unless  an 
inconsistent  formulation  is  applied.  BIODIR  is  part  of  a  large  class  of  look-ahead  Lanczos 
methods,  which  are  designed  to  overcome  Lanczos  breakdowns  and  near  breakdowns.  A 
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Algorithm  3  Communication-avoiding  consistent  BIOMIN 
Require:  Initial  approximation  xo  for  solving  Ax  =  b. 

1:  Compute  po  =  r0  =  (6  —  Axo). 

2:  Choose  po  =  r0  so  that  <50  =  rjfr0  ^  0  and  <5'0  =  PoAp0  ^  0. 

3:  repeat  for  m  =  0, 1, . . 

4:  Extend  the  four  Krylov  bases  by  s  dimensions,  using  the  matrix  powers  kernel  (Akx). 

5:  P,R  ,B  =  Akx  (A,  s,  \pm,rm\),  P,R  ,  B  =  Ak  x  (AH ,  s,  \pm,  rm\) . 


:  Compute  G  = 

_ i 

h  r „ 

P,  R 

L.  _l  _l 

:  Initialize  coefficient  vectors 

i-  .  “i 

r_o  ^1 

0s+l,s+l 

:  [c0>  co>  •  •  • 

coJ  = 

B 

K  ■  ■  ■ )  ^o.  ~ 


0s+l,s+l 

B 


K,  ah,  ■  ■  ■ ,  °o]  — 
K  tf),  •  ■  ■ ,  6g]  = 


B 

0s+l,s+l 

B 

0s+l,s  +  l 


02s+2,1 

1 


for  j  —  0  :  s  —  1,  do 

fi'm+j  =  (b(j)  Gad.  If  S'm+j  =  0,  declare  pivot  breakdown  and  STOP. 
Take  steps  along  search  directions: 

um+j  —  dm+j /bm+j  ■ 

ck+l  =  ck  -  um+jak+l ,  dkj+ 1  =  dki-  UPR~bk+1,  for  k  =  0  :  s  -  j  - 


Jm+ju  j 


bk+1 ,  for  k  —  0  :  s  —  j  —  1. 


Update  candidate  solution:  eJ+i  =  e3  +  um+j  . 
Set  8m+j+1  =  (d°j+1)H  Gc°j+1.  If  Sm+j+l  =  0,  STOP. 


If  P,R  c°j+1  =  0,  terminate  with  xex  =  P,R,xm  eJ+ 1. 
Otherwise,  declare  Lanczos  breakdown. 

Update  search  directions: 

Pm+j  dmjrjjr\/8mjrj. 

aj+i  =  cj+1  -  Pm+jtpji  bkj+ 1  =  dkj+l  -  if; m+jbj ,  for  k  =  0  :  s  -  j  -  1. 

end  for 

Recover  iterates  from  last  inner  iteration. 


T  m+s  — 


P,  R  c°,  pm+s  =  P,R  a°s, 

R,  R  dg ,  Pm- |-s  P,  R  bg,  ■t'ni  ■  .s  P i  R)  %m  &s- 

is  sma  1. 


until  ||rm+s||  is  small. 

Terminate  with  approximate  solution  x  ~  xm+s. 
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Algorithm  4  Communication-avoiding  inconsistent  BIOMIN 
Require:  Initial  approximation  xq  for  solving  Ax  =  b. 

1:  Compute  po  =  r0  =  (6  —  Ax o)  /7_i  for  some  7_i  p  0,  e.g.,  so  1 1 r0 j  |  =  1. 

2:  Redefine  xq  =  £0/7-1  ■ 

3:  Choose  po  =  U)  so  that  <50  =  f(/r0  7^  0  and  <5'0  =  pPAp0  p  0. 

4:  repeat  for  m  =  0, 1, . . 

5:  Extend  the  four  Krylov  bases  by  s  dimensions,  using  the  matrix  powers  kernel  (Akx). 

6:  P,R  ,B  =  Ak  x(A,  s,  \pm,rm\),  P,R  ,B  =  Akx  (AH,  s,  \pm,  fm]) . 


:  Compute  (S'  = 

- 1  L 

b3*> 

_ i 

h  r „ 
P,R 

i-  _ii_  _i 

:  Initialize  coefficient  vectors 

r  „  “i 

rj  ^1 

Us+l,s+l 

:  iC0’  co>  •  •  • 

co\  = 

B 

Woi-"!  ^()]  ~ 


0s+l,s+l 

B 


[Oq,  Oq,  .  .  .  ,  Oq]  —  ( 

[&8, b^, ...  ,bo\  =  ( 


B 

0s+l,s+l 

B 

bs  +  l,s+l 


02s+2,l 

1 


for  j  —  0  :  s  —  1,  do 

b'm+j  =  (p°j)  Gdj.  If  8'm+j  =  0,  declare  pivot  breakdown  and  STOP. 
Take  steps  along  search  directions: 

Pm+j  =  ^m+j/^rn+j- 

Choose  7 m+j  p  0  and  7 'm+j  p  0  arbitrarily. 

ckj+1  =  (ck  -  pm+jak+1)  hm+j,  dkj+ 1  =  (dk  -  pm+jbk+l)  /pm+j. 

s-j-1. 

^rn+j + 1  ^m+jPm+j / rtm+j  t  with  7Tq  1  / 7 —  1  * 

(  raoi  \ 

Update  candidate  solution:  eJ+i  =  —  I  pm+jej  +  gJ  J  h m+j- 
Set  5m+j+1  =  (d°j+1)H  Gc°j+1.  If  5m+j+1  =  0,  STOP. 


*)  /t m+j,  for  k  =  0  : 


P,R 


c°+1  =  0,  terminate  with  xex  = 


^  -P, -R,  £rn  7/  '  1  /T m+j+1 • 


Otherwise,  declare  Lanczos  breakdown. 

Update  search  directions: 

P m+j  —  lm+jdm+j+l/ dm+j,  Pm+j  —  'Im+j^m+j+l/ bm+j- 

ak+1  =  ck+1  -  Pm+jakj ,  bk+ 1  =  dk+1  -  pm+jbk ,  for  k  =  0  :  s  —  j  -  1. 

end  for 


27: 

T'm+s  P  •>  R 

Cs,  Pm+s 

28: 

T’mp-s  P,  R 

ds,  Pm+s  — 

P,R 

29:  until 

ll^m+sll  /  |^"m+s 

is  small. 

3?ra+s 


-P  5  -R?  %m 


Terminate  with  approximate  solution  x  «  xm+s/7rm+s. 
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communication-avoiding  BIODIR  formulation  can  be  derived  in  a  similar  way  to  BIORES 
and  has  similar  costs  except  requires  the  storage  of  an  extra  iterate.  We  conjecture  that  our 
communication-avoiding  approach  will  generalize  to  other  look-ahead  Lanczos  methods;  this 
is  future  work. 


Algorithm  5  Consistent  BIORES 

Require:  Initial  approximation  x0  for  solving  Ax  =  b. 

1:  Compute  ro  —  (b  —  Ax o). 

2:  Choose  f0  so  that  <50  =  r(fr0  ^  0. 

3:  repeat  for  m  —  0, 1, , ,. ., 

4.  Set  cem  rmAr m/dm  and  oim 

o:  Set  Pm—1  7m— ]Am/^m— 1  and  Pm—1  Pm— Aim—  1  /r)m—  1 )  with  P—\ 

P-i  =  0. 

6:  Set  7 m  =  am  —  Pm- 1-  If  7m  =  0,  declare  pivot  breakdown  and  STOP. 

7:  Choose  7m  7^  0  arbitrarily. 

8-  l'm+ 1  ( Arm  Om,r m  /^m— Rm— l)  lrlm 

9-  Cn+1  ^  m  ^rArn  Pm—  Am—  1^  /'tm 

10:  Xm_)_x  (rm  T  OimXm  T  Pm—lXm—l)  /7m- 

11:  5m+i  =  r^+1rm+1.  If  dm+i  =  0,  STOP. 

12:  If  rm+1  =  0,  terminate  with  rcex  =  xm+i.  Otherwise, 

13:  If  fm+i  7^  0,  declare  Lanczos  breakdown ; 

14:  If  rm+i  =  0,  declare  left  termination. 

15:  until  ||rm+i]|  is  small. 

16:  Terminate  with  approximate  solution  x  ~  xm+i. 


3.2.4  CA-BIORES 

The  derivation  here  seems  more  complicated  than  CA-BIOMIN  but  the  difference  is  super¬ 
ficial,  purely  due  to  the  fact  that  we  represent  the  iterates  partially  in  terms  of  the  s-step 
history 


Re-1  f  m— s+l  i  T  m— s+2 ) 


i  and  Rf>—\  rm—s+li  I’m— s+2) 


where  i  indexes  the  outer  iterations,  and  m  =  is  +  j ,  where  j  indexes  the  inner  iterations  for 
the  current  outer  iterations.  We  combine  the  s-step  history  with  the  s-step  bases  generated 
from  rm  and  rm,  as 


R 

R 


[Re- i(b  1  :  s),Po  {A)rmip1  (A)rm,  ...,ps(A)  rm\ 

Re-  i(:,  1  :  s),  po  ( AH )  rm,  p,  (AH)  rm, ps  ( AH )  rm 
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Algorithm  6  Inconsistent  BIORES 

Require:  Initial  approximation  xq  for  solving  Ax  =  b. 

1:  Compute  r0  =  (b  —  Ax0 )  /7_i  for  some  7_i  ^  0,  e.g.,  so  ||r0||  =  1. 

2:  Redefine  x0  =  £0/7-1. 

3:  Choose  r0  so  that  <50  =  r//r0  0. 

4:  repeat  for  m  —  0, 1, . . 

5;  Set  o m  r mAr mf 6m,  and  o otm. 

6:  Set  Pm—  1  7m— 1  &ad  Pm— i  7m-l^m/^m-l  Pm—llfm—l/P/m—li  with  /5 —  i 

/3_i  =  0. 

7:  Choose  7m/0  and  7m  7^  0  arbitrarily. 

8:  7I"m+l  =  (ttm^m  T  Pm—  1  ^m—  1 )  /7m.  with  7To  —  1/7—1- 

9;  ‘V m+ 1  =  (A'C m  OimT m  Pm-l  'r m—l )  / 7m 

10:  r ’TO+1  =  ^A  fm  Om'C m  Pm—\f m_i^  /f/m- 

11:  £m_)_l  =  ('Cm  +  OtmXm  T  Pm—  \Xm-l)  /' /m ■ 

12:  5m+i  =  r^+1rm+1.  If  Sm+ 1  =  0,  STOP. 

13:  If  rm+i  =  0,  terminate  with  £ex  =  £m+i/7rm+i .  Otherwise, 

14:  If  rm+i  7^  0,  declare  Lanczos  breakdown ; 

15:  If  rm+i  =  0,  declare  left  termination. 

16:  until  ||rm+i||  /  |7rm+i|  is  small. 

17:  Terminate  with  approximate  solution  x  ~  £m+i/7rm+i. 


where  pj  ( z )  and  pj  (z)  are  polynomials  of  (exact)  degree  j.  The  matrix  powers  kernel  also 
provides  change-of-basis  matrices  B  and  B  to  perform  the  transformations 


R  := 
R  :  = 


R,S  0s,s+l 
0s+i,s  B 


R 


R,s 

0s+l,s 


0s,s+l 

B 


J"m—s+ 1)  ■  •  •  i  I’m—  i ,rm,Arm,  •  •  •  ,Asrm ] 


I’m— s+1)  ■  ■  ■  ■  I’m— 1)  I’m;  A  Cm,  •  •  •  ,  (A  )  Cm 


We  will  apply  these  transformations  implicitly.  Note  that  the  initial  conditions  /3_i  =  P- 1  = 
0  truncate  the  three-term  recurrence  before  m  =  0  and  allow  us  to  define  77  =  77  =  0n;i  when 
i  <  0,  and  r0  and  r0  are  readily  available  from  the  input  data.  We  assume  m  >  0. 

Our  goal  is  to  compute 

R(.  [On+l>  l’m+2)  •  •  •  7  ^m+s]  and  Rg  [^m+1)  Cn+2)  •  •  •  )  ^m+s] 

from  the  s-step  history  and  s-step  bases.  Examine  the  residual  update  formulas: 
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manipulated  to  become 


^4  rm+j+i  \ 


{AH)kfm+j+ 1  =  l 


A '  vrnjrj  A  '  [r //-/+/_  i  • 

A  Ifm+j  j  ^m+j+1  j  fm+j+2\ 

(A^)A+1rm+i  -  [f m+j-l,  r. ’m+j] 

(y4  )  [r m+j ,  f m+j+l,  An+j+2] 


P  m+j — 1 
ft  m+j 

~&m+j+ 1 
Tra+j+l 


Pm+j—1 

A-m+j 


Pm+j 

"Om+j+l 

Tro+j+l 


and  introduce  the  coefficient  vectors 

Rck  =  Akrm+j  and  Rdk  =  ( AH ) fm+J 

As  with  BIOMIN,  we  substitute  these  coefficient  vectors  into  the  recurrences  for  the  left  and 
right  residuals.  This  step  is  more  involved  here  so  we  provide  an  intermediate  result  before 
stating  the  algorithm.  Provided  the  Lanczos  coefficients  are  available  (we  will  show  how  to 
compute  them  shortly),  the  constraints  —s<j<s  and  0  <  k  <  s  allow  us  to  express  the 
left  and  right  residuals  as 

|  Pm+j—  1 

^ m+j 


^+1  [  k  nk 

cj  Lcj-i’  cji 


ct+i  =  < 


dk  —  l 

aj+i  ~  < 


Os,i 

B(:,k  +  1) 


[A-l  rk- 1  rk- 11 

L  cj  ’S+i’S+2j 

Is,s('-1  s  +  j  +  1) 

Os+i,i 

dj+1-  [dJ-l>dJ] 


Os,i 

B(:,k  +  1) 


rjfc-i  jfc-i  jfe-i] 
."j  l-"./-2l 

s  +  j  +  1) 

Os+i,i 


/  Ifm+j  j  1 

J  =  “I 


Pm+j 

—  (Xm+j+1 

lfm+j+1 


j  <  —1,  k  >  0 

J  <  -l,fc  =  0 


Pm+j — 1 

^  /  Pm+j  j  A>  1 

J  =  “I 

—  Pm+j 

®m+j + 1 

j  <  -1,  k  >  0 

Tm+j+1 

j  <  -1,  k  =  0 
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We  avoid  the  inner  products  in  the  same  way  as  CA-BIOMIN,  by  the  use  of  the  Gram-like 

matrix  G  =  RH R.  The  rest  of  the  computations  in  BIORES  require  no  communication. 
We  present  both  consistent  and  inconsistent  versions  of  communication-avoiding  BIORES 
together  in  Algs.  7  and  8. 

3.3  Communication-avoiding  conjugate  gradient  squared  method 
(CA-CGS) 

The  multiplication  by  AH  and  corresponding  left  Krylov  vectors  (residuals  f)  only  contribute 
to  the  BIGG  solution  of  Ax  =  b  through  the  scalar  coefficients.  BIGG  does  not  exploit  the 
reduction  in  magnitude  of  the  left  residuals  unless  we  solve  a  dual  system.  In  an  effort  to 
get  faster  convergence,  or  when  AH  might  not  be  available,  for  instance  when  the  linear 
system  A  is  the  Jacobian  in  a  (matrix-free)  Newton-Ivrylov  method,  one  might  demand 
a  transpose-free  method  such  as  CGS  (conjugate  gradient  squared,  [35]),  a  QOR  (quasi- 
orthogonal  residual)  method  derived  from  BIGG  (another  QOR  method),  that  avoids  the 
multiplication  by  AH .  CGS  respects  the  mutual  biorthogonality  of  the  two  Krylov  spaces 
(and  so  the  Lanczos  coefficients  are  the  same  as  BICG,  in  exact  arithmetic),  however,  the 
polynomials  representing  the  CGS  residuals  are  the  squares  of  those  in  BICG.  These  squared 
residuals  are  required  to  avoid  keeping  track  of  the  left  Lanczos  vectors.  However,  CGS 
actually  interprets  the  squared  BICG  residuals  as  the  true  residuals,  and  updates  the  solution 
accordingly.  This  heuristic  decision  was  motivated  by  the  observation  that  the  BICG  residual 
polynomials  typically  reduce  the  norm  of  the  starting  vector,  and  so  one  would  hope  that 
applying  the  BICG  polynomial  again  (to  an  already-reduced  residual)  might  reduce  it  further. 
But  because  it  squares  the  polynomials,  CGS  might  have  more  irregular  convergence  than 
BICG.  As  a  side  effect,  larger  intermediate  quantities  in  CGS  could  worsen  local  round  off, 
leading  to  a  (faster)  deviation  between  computed  and  true  residuals. 

3.3.1  BIOMINS 

We  consider  the  BIOMINS  form  of  CGS,  the  original  from  variant  [35],  presented  in  Alg. 
9.  Alg.  9  has  been  slightly  reorganized  to  absorb  his  auxiliary  quantities  u  and  v.  Some 
notation  changed  from  his:  (a,  (3,  p,cr,q)  — *  (cu,  i/j,  5,  5',  s).  Our  notation  also  borrows  from 
§7.4.1  of  [33]  and  §14  of  [16], 

3.3.2  CA-BIOMINS 

Of  the  four  vector  iterates  in  Alg.  9,  three  are  explicitly  multiplied  by  A.  The  standard 
version  in  [35]  only  performs  two  actual  SpMV  operations  per  iteration.  Our  CA  imple¬ 
mentations  will  involve  a  call  to  Akx  with  three  right-hand  sides,  and  to  the  2s  power,  to 
replace  s  iterations.  This  is  a  3x  increase  in  sparse  flops,  more  depending  on  redundant 
flops.  We  note  that  a  CA-BIORESS  (three-term  CGS)  implementation  would  only  involve 
two  right-hand  sides,  in  exchange  for  maintaining  a  2s-step  history.  Furthermore,  we  can 
derive  an  inconsistent  version  of  CA-BIORESS  with  the  same  advantages  as  inconsistent 
BIORES.  However,  we  believe  that  CA-BICGSTAB  is  a  better  alternative  than  any  of  these 
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Algorithm  7  Communication-avoiding  consistent  BIORES 
Require:  Initial  approximation  rr0  for  solving  Ax  =  b.  Compute  r0  —  (b  —  Axq). 
l:  Choose  f0  so  that  S0  =  rjfr0  7^  0. 

2:  Base  case:  R_  1  =  0n;S  and  77_i  =  0n>s,  rr_i  =  0nii 

3:  Base  case:  1  =  /3,_i  =  a*  =  d;  =  5*  =  %  =  %  =  0  for  i  =  —  s  :  —  1 

4:  repeat  for  t  —  0, 1, 2  . . .  and  m  =  s7, 


Set  7?  0niS+i]  and  7?  1>  0n;S_|_i  . 

77(:,  s  +  1  :  2s  +  1),  B  =  Akx  (A,  s,  rm),  7?(:,  s  +  1  :  2s  +  1),  B 
Akx  (AH,  s,  fm). 

Compute  G  =  77^77. 

rr.°  ro  r°  1  -  Is,s  \d°  d°  <7°  1  -  Is,s 

Lc-s?  c-s+i?  •  •  •  j  c-iJ  —  n  ’  L  -s’-s+i’  *  •  •  ’a-iJ  —  n 

Us+l,sJ  L^S+ljS 


for  j  =  —s  :  —2,  do 

for  k  —  l:s  +  j  +  ldo 


fim+j 


fim+j 


Jz  _  \  k—\  k — 1  k— 11  _ „  Jk  _  r/fc-l  jk— 1  j/c— 1]  ~ 

°j+l  —  |_°j  ^m+j+l  ?  aj+l  —  7  aj+l  ’  ai+2j  ^m+j+l 


end  for 
end  for 

rro  1  s]  _ 

Lco>  c05  -  -  -  5  coJ  —  ^ 


Ym+j+ 1 


3  +1  +5]  _ 

D>  u0>  '  '  '  5  uoJ  — 


Tm+j+l 


0s,s+l 

B 


e~i  = 


02s+2,l 

1 


,  Co  — 


02s+l,l 

1 

0 


for  j  —  0  :  s  —  1,  do 

Set  o  rn . j  ( dj  j  TjCj  and  o.m+j  o  rn  .  3 . 

Set  1  7wh-j— l^m+j /bm+j— 1 1  with  /?_  1  0. 

Set  7rn+j_  1  ^m+j /^m+j_  1  fim+j—l'Ym+j—l/'Ym+j—li  with  /5 —  i  0. 

Set  7m+j  =  am+j  —  Pm+j- 1-  If  'Ym+j  —  0,  declare  pivot  break, down  and  STOP. 
Choose  7m+j  7^  0  arbitrarily. 

cj+1  =  (cf1  -  [cj_i,  cj]  ^  /lm+j,  for  k  =  0  :  s  -  j  -  1. 

d*+i  =  [djr\  dJ+J,  djg]  —  dm+j+i  ,  for  k  =  0  :  s  -  j  -  1. 

Tm+j'+l 


J  +[ej_i,ej] 
02,1 


Set  <Sm+j+1  =  (ii“+1)  Gc§+1.  If  =  0,  STOP. 

If  7?c°+1  =  0,  terminate  with  xex  =  R,xm_i,xm  e3+\. 

If  Rdj. |_i  jtz  o,  declare  Lanczos  breakdown ; 

If  RdJ+\  =  0,  declare  left  termination. 

end  for 

l'm+.s  7?CS,  1'm+s  Rdg ,  23m_|_s  R^Xm—i^Xm  €s. 
until  ||rm+s||  is  small. 

Terminate  with  approximate  solution  re  ~  rcm+s. 


Algorithm  8  Communication-avoiding  inconsistent  BIORES 
Require:  Initial  approximation  x0  for  solving  Ax  =  b. 

1:  Compute  r0  =  (b  —  Ax0 )  /7_i  for  some  7_i  7^  0,  e.g..  so  ||r0||  =  1. 

2:  Redefine  x0  =  rr0/7-i-  Choose  f0  so  that  d0  =  r^r0  7^  0. 

3:  Base  case:  RL 1  =  0n;S  and  R_i  =  0njS,a;_i  =  0n>i 

4:  Base  case:  /3j_i  =  1  =  a*  =  a*  =  5*  =  7^1  =  77-1  =  77  =  0  for  i  =  —  s  :  —  1 

5:  repeat  for  i  —  0, 1, 2  . . .  and  m  =  s£, 

6:  Set  R  [Re— 1?  0n,s+i]  and  R  Ra— 1?  0n,s+i  . 

7:  R(:,  s  +  1  :  2s  +  1),  B  =  Ak x(A,s,  rm),  R(:,  s  +  1  :  2s  +  1),  B 

Akx  (AH,  s ,  rm). 

8:  Compute  G  =  R^R. 


!r°  r°  r°  1  =  ^s,s  \d°  d°  d°  1  = 

—  S’  —  S+l’  •  •  •  5  lj  f)  ’  La— S’  a— s+l’  *  *  *  ’  U/— lj  r 

[_us+l,sJ 


for  j  =  —s  :  —2,  do 

for  =  1  :  s+j  +  1  do 

Prn+j 

Pk  _  r_fc-i  _fc-i  _fc-ii  i 

C+i  —  LLi  ’  Li+i  ’  S+2J  u»n+j+i 


end  for 
end  for 

[co>  co>  ■  •  • >  co]  = 


for  j  =  0  :  s  —  1,  do 


lfm+j+1 


,  [rfg,  dg,  .  .  .  ,  rfg  — 


fim+j 

Xk  _  [jfe-1  jfc-1  jfc-1]  ~ 

CC+1  —  LUJ  )  aj+li  aj+2j  “m+j+l 

7m+j+l 

n  1  Tn  1  02s+i,i 

us,s+l  U2s+2,1  n 

R  J  ’  e_1  =  [  1  J  >  e°  =  0 


,  e_i  = 


O2S+27 

1 


)  e0  — 


Tn+y  ( d?  j  GCj,  (Xm+j  (-^in+j '  fim+j— 1  7to+j  —  l^m+j/^m+j—  li  with  (3  |  0. 

Set  i  7m+,y  —  1  b>nf dm+j -  1  Pm+j—l^Ym+j—l/'Ym+j—li  with  /3_i  0. 

Choose  7m+j  7^  0  and  7m+J  7^  0  arbitrarily. 

^’m+j+l  =  "b  fim+j— l^m+j— l)  jr Ym+j  with  7Tq  =  1/7-1- 

cj+i  =  (ci+1  -  [c*-i,  cj]  /3”^1  )  ll m+j,  for  k  =  0  :  s  -  j  -  1. 


4+i  =  [4  17ij+n  4+2 ]  "%tj+i  ,  for  k  =  0  :  s  -  j  -  1. 

Tm+j'+l 

%+i  =  -  (C"  1  +  h-i.ej  [++  Aw 
V  |_U2.lJ  L  am+i  J  J 

Set  <5m+j+1  =  (d°j+1)H  Gc°j+1.  U8m+j+1  =  0,  STOR 

If  Rc°j+1  =  0,  terminate  with  xex  =  ^  R,xm- i,xm  ej+1  j  /nm+j+1. 

If  Rdj+i  7^  0,  declare  Lanczos  breakdown ; 


If  Rd-,+1  =  0,  declare  left  termination. 


end  for 

+m+.s  ^ m+s  Rds,  Xm_|_s 


>  1  ? 


until  llr^ 


is  small. 


Terminate  with  approximate  solution  re  ~  rcm+s/7rm+s. 


Algorithm  9  BIOMINS _ 

Require:  Initial  approximation  x0  for  solving  Ax  =  b. 

1:  Compute  So  =  Po  —  ro  =  b  —  Ax0. 

2:  Choose  f0  arbitrary  such  that  80  =  r(fr0  ^  0  and  <5'0  =  r^Ap0  ^  0. 

3:  Set  'ip-i  =  0  and  s_i  =  0n>s. 

4:  repeat  for  m  —  0,1,... 

5:  Set  5'm  =  rff  Apm.  If  S'm  =  0,  STOP  and  declare  pivot  breakdown. 

6:  Take  steps  along  search  directions: 

7:  M m  8 mf  8m- 

8:  Sm  —  Tm  T  'Ipm—iSm—l  mApm . 

9:  rm+ 1  =  r m  -  2umAr m  -  2u}m^m^1Asm-1  -  UmA2pm. 

10:  Update  candidate  solution: 

11:  xm+i  xm  T  2umr m  T  2um'i^m—ism,—i  T  w mApm ■ 

12:  Set  5m+ 1  =  f^rm+i.  If  <5m+1  =  0,  STOP. 

13:  If  rm+ 1  =  0n>i,  terminate  with  a:ex  =  xm+i- 

14:  Otherwise,  declare  Lanczos  breakdown. 

15:  Update  search  direction: 

16:  ^m+ 1/  ^m- 

17:  Pm+1  =  rm+ 1  +  2^mSm  +  V'mPm- 

18:  until  ||rm+1||  is  small. 

19:  Terminate  with  approximate  solution  x  ~  xm+i. 

methods,  in  terms  of  stability  and  performance,  and  will  not  discuss  any  CCS  variants  other 
than  consistent  BIOMINS. 

We  start  with  the  iterates  (rm,pm,  sm- 1)  given  by  the  initial  data  (m  =  0)  or  from  a 
previous  iteration  (m  >  0).  We  call  the  matrix  powers  kernel  to  compute 

P  \.Po  (^4)  Pmi  Pi  (-'4)  Pirn  •  •  •  i  Ps  (-*4)  Pm] 

P  \P0  (-'4)  I’m;  Pi  (-'4)  I’m;  •  •  •  j  Ps  (-*4)  I’m] 

P  \P0  (-4-)  Sm-li  Pi  (-'4)  Sm—  1)  •  •  •  j  Ps  (-4)  Sm—  l] 

where  pj  ( z )  and  pj  ( z )  are  polynomials  of  (exact)  degree  j.  (In  principle,  we  might  use  a 
different  polynomial  for  each  basis  matrix.)  Akx  also  provides  a  change-of-basis  matrix  B 
that  allows  us  to  convert  back  to  the  Krylov  basis, 


02s+l,2s+l 

P,R,S\  B  =  [rm,Arm,...,A2srm] 

|_02s+l,2s+l_ 


P  P  (I  04s+2,2s+l  _  r  yl  a2s  ] 

i  ,  it,  O  \£>m— li  li  •  •  •  ?  ^m— lj 

although  we  apply  this  transformation  implicitly. 
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Using  these  2s-step  Krylov  bases,  substituting  the  coefficient  vectors,  defined  for  0  <  k  < 


25, 


P,R,S 

P,R,S 

P,R,S 


A  Pm+j 
(AH)krm+j 


and  row  vector  g  =  fijf  P,R,S  ,  we  arrive  at  a  communication-avoiding  version  in  Alg.  10. 


3.4  Communication-avoiding  biconjugate  gradient-stabilized  method 
(CA-BICGSTAB) 

The  CGS  method  reinforced  the  idea  that  the  BICG  algorithm  only  exploits  the  fact  that 
the  right  Krylov  basis  is  biorthogonal  to  the  left  Krylov  basis  -  the  biorthogonality  need  not 
be  mutual.  Lanczos-type  product  methods  (LTPMs)  use  a  different  polynomial  recurrence 
for  the  left  basis  and  combine  this  with  the  CGS  strategy,  applying  this  second  polynomial 
to  the  right  basis  in  order  to  avoid  computing  the  left  basis  at  all.  The  hope  is  that  the 
left  polynomial  further  reduces  the  residual.  Many  LTPMs,  including  BICGSTAB  and  its 
variants,  choose  the  left  polynomial  to  have  smoothing  or  stabilizing  properties.  As  with 
Lanczos  (versus,  e.g.,  Arnoldi),  a  shorter  polynomial  recurrence  is  preferable  for  performance 
reasons. 

In  the  case  of  BICGSTAB,  the  left  polynomial  takes  a  two-term  recurrence,  and  amounts 
to  extending  the  Krylov  space  by  one  dimension  (a  new  basis  vector),  and  taking  a  steepest 
descent  step  in  that  direction  (line  search).  This  is  a  local  one-dimensional  minimization, 
which  should  result  in  a  smoother  convergence  curve  and  avoid  possible  overflow  conditions 
in  CGS.  However,  an  issue  with  BICGSTAB  is  that  if  the  input  data  is  all  real,  the  stabilizing 
polynomial  will  have  only  real  zeros.  Such  a  polynomial  will  not  reduce  error  components  in 
the  direction  of  eigenvectors  corresponding  to  eigenvalues  with  large  imaginary  components 
(relative  to  their  real  components).  Matrices  with  such  a  spectrum  are  also  more  susceptible 
to  a  minimization  breakdown  in  BICGSTAB,  a  new  breakdown  condition. 

These  two  drawbacks  to  BICGSTAB  are  addressed  in  the  literature  by  many  newer 
LTPMs  by  using  (at  least)  two-dimensional  residual  smoothing.  Other  smoothers,  like 
Chebyshev  polynomials,  have  also  been  considered.  In  this  work,  we  only  demonstrate 
our  communication-avoiding  approach  on  the  simplest  LTPMs  (CGS  and  BICGSTAB).  We 
conjecture  that  our  approach  generalizes  to  all  LTPMs  with  short  polynomial  recurrences, 
and  that  the  flexibility  to  choose  a  polynomial  basis  in  Akx  could  accelerate  the  computation 
of  the  left  polynomials,  when  the  recurrence  coefficients  are  known  in  advance. 

We  also  note  that  BICGSTAB (£)  is  a  promising  direction  for  future  work,  especially  once 
combined  with  the  communication-avoiding  tall-skinny  QR  kernel,  as  described  in  [20]. 

We  present  the  version  of  BICGSTAB  from  [33]  in  Alg.  11.  Note  that  the  vector  iterates 
sm  are  unrelated  to  the  scalar  s;  no  ambiguity  will  arise  since  the  former  is  always  subscripted 
while  the  latter  is  never. 
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Algorithm  10  Communication-avoiding  BIOMINS 
Require:  Initial  approximation  x0  for  solving  Ax  =  b. 

1:  Compute  sq  =  po  =  ro  =  b  —  Ax o- 

2:  Choose  f0  arbitrary  such  that  <50  =  r, f r0  p  0  and  S'0  =  r^Ap0  ^  0. 

3:  Set  p-\  =  0  and  s_i  =  0„)S. 

4:  repeat  for  m  —  0, 1, . . 

5:  Compute  three  2s-step  Krylov  bases  using  the  matrix  powers  kernel  (Akx): 

6:  P,R,S  ,B  =  akx  (A,  2s,  \pm,rm,sm_i\). 

7:  Compute  g  =  r, f  P,R,S  . 

8:  Initialize  coefficient  vectors 

9:  [a0,  CIq,  ,  Oq]  ^  , 

_(-Z4s+2,2s+l_ 

02s+2,2s+l 

10:  [b%,bl,;>,.,bs0]=  B  ,  and 

O Oo-LO  Oc_|_1 


2s+2,2s+lJ 

r„0  ,.l  1  _  04s+2)2s+l 

|_L-1>  c-1>  '  '  '  >  c-lJ  —  g 


,  and 


C°  _  ; 


for  j  =  0  :  s  —  1,  do 

b'm+j  —  9a)-  If  Ki+j  =  O’  declare  pivot  breakdown  and  STOP. 
Take  steps  along  search  direction: 

^ rn+j  =  dm+j /dm+j. 

c)  =  +  VWj-iCj-!  -  for  A:  =  0  :  2  (5  -  j  -  1). 


bj  2ujm+jb-  2ujrnjrj'iprn-\-j—\c, 


m+j— 1 


,  .2 

u'm+:?  06  ra  5 


for  k 


r  6°i  rc° 

Update  candidate  solution:  ej+i  =  ej  +  2cum  J  +  2umprn-i  +  tu. 

Set  Sm+j+ 1  =  If  5m+j+ 1  =  0,  STOP. 

If  P,R,S  b®+1  —  0n>i,  terminate  with  xex  —  P,R,S,xm  ej+\. 
Otherwise,  declare  Lanczos  breakdown. 

Update  search  direction: 

Pm+j  dm+j+l/dm+j- 

ctj+1  =  bj+1  +  2 pm+jCkj  +  p2m+jam+j,  for  k  =  0  :  2  (s  -  j  -  1). 

end  for 

Recover  iterates  from  last  inner  iteration. 

f m+s  =  P,R,S  6°,  pm+s  =  P,R,S  <Z°, 

Sm+s—i  P,R,S  cs_^,  and  xm4.s  P^R,  S,  xm  es. 
until  ||rm+s||  is  small. 

Terminate  with  approximate  solution  x  ~  xrn+s . 
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Algorithm  11  BICGSTAB _ 

Require:  Initial  approximation  xq  for  solving  Ax  =  b. 
1:  Compute  p0  =  r0  =  (b  —  Axq). 

2:  Choose  f0  arbitrary. 

3:  repeat  for  m  —  0, 1, ... 

4:  am  =  (f  jf  rm)  /  (f^Apm)  . 

5:  sm  —  v  m  amApm 

6:  cum  ( smAsm )  /  ^(Asm)  . 

7:  Xmjr\  Xm  +  C^rnPm  “I- 

8-  V  m-\-i  —  5m  (jJrnAsrn. 

9:  /5m  =  (r^rTO+1)  /  (r^rm)  x  (am/um). 

10:  Pm+1  =  rm+i  +  /5m  (pm  -  umApm). 

11:  until  ||rm+i||  is  small. 

12:  Terminate  with  approximate  solution  x  ~  xm+i. 


3.4.1  CA-BICGSTAB 


First,  we  note  that  the  vector  iterates  sm  are  auxiliary  quantities  introduced  to  simplify  the 
underlying  coupled  two-term  recurrences 

rm+ i  =  (/  -  umA)  (r m  -  amApm ) 

Pm+l  =  f m+ 1  T  /3m  (I  hJ m,A )  pm 

and  we  do  not  need  them.  We  substitute  the  definition  sm  =  rm  —  amApm  to  recover  these 
recurrences,  and  this  also  gives  us  new  expressions  for 


't'rn Al"rn  ^m^tn  A~pm  OlmpmA  A Tm  T  OtmOirnprnA  A  p.m 
rmAHArm  -  amr%AHA2pm  -  a^p%  ( AH )2  Arm  +  a^am  ( AH )2  A2pm 
•Em  T  Q?mPm  T  ^m  (j" m  Aprn  ) 


Since  BICGSTAB  avoids  multiplications  by  AH,  while  applying  A  twice  per  iteration,  the 
underlying  Krylov  space  is  extended  by  two  dimensions  per  iteration,  as  opposed  to  BICG. 
Thus,  in  order  to  take  s  steps  of  BICGSTAB  in  a  communication-avoiding  manner,  we  need 
to  explicitly  extend  the  Krylov  spaces  by  2s  dimensions  with  Akx.  Also,  we  compute  inner 
products  involving  the  shadow  residual  vector  r0  as  well  as  iterates  {r,p}.  We  will  replace 
the  former  using  a  “Gram  vector”  and  the  latter  with  a  “Gram  matrix:” 


g  =  ro  [A  R 
G  =  \p,r]H  \p,R 


Our  call  to  Akx  takes  the  two  source  vectors  \pm,rm]\  and  produce  the  2s-step  bases  [P,  H], 
similar  to  CA-BICG.  Lastly  we  introduce  the  coefficient  vectors 
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and  use  these  vectors  as  well  as  G  and  g  to  reach  the  communication-avoiding  version,  in 
Alg.  12. 

A  more  costly  version  of  CA-BICGSTAB  can  be  derived  using  the  auxiliary  vector  iterates 
sm  in  11.  In  this  case,  we  call  Akx  with  three  source  vectors  to  compute  three  s-step  bases, 
as  well  as  introduce  a  third  coefficient  vector.  We  do  not  present  this  algorithm  here,  but  it 
can  be  easily  derived  like  Alg.  12.  When  we  performed  our  numerical  experiments,  we  used 
this  three-iterate  version  in  order  to  compare  with  MATLAB  BICGSTAB  (which  also  uses 
three  iterates).  It  is  future  work  to  see  if  and  how  the  floating-point  properties  of  the  two- 
iterate  version  (Alg.  12)  differ  from  this  three-iterate  version.  We  note  that  the  three-iterate 
version  does  more  flops  than  Alg.  12,  and  so  conjecture  that  Alg.  12  will  have  no  worse 
round  ofF.  That  is,  we  conjecture  our  numerical  experiments  will  produce  similar  (if  not 
better)  results  when  we  implement  Alg.  12. 

3.5  Preconditioning 

Preconditioning  is  a  technique  frequently  used  in  practice  to  accelerate  the  convergence  of 
KSMs  by  replacing  Ax  =  b  by  an  equivalent  linear  system  Ay  =  c,  where  A  =  M^AM^1, 
y  =  Mrx.  and  c  =  M^b.  We  call  ML  and  MR  left  and  right  preconditioners,  respectively; 
these  preconditioners  are  chosen  so  that  A  has  a  smaller  condition  number  than  A.  When  A 
and  A  are  normal,  this  lowers  theoretic  upper  bounds  on  the  worst-case  rate  of  convergence 
for  Krylov  methods.  Selecting  an  appropriate  preconditioner  is  a  wide  field  of  study  itself, 
which  we  do  not  discuss  further  here. 

Our  concern  is  primarily  the  extension  of  our  methods  to  the  preconditioned  case,  such 
that  communication  is  still  avoided.  Foremost,  we  require  that  the  preconditioned  matrix 
A  can  be  well-partitioned  or  have  a  special  structure  we  can  exploit  -  otherwise,  we  cannot 
avoid  communication  with  the  matrix  powers  kernel.  However,  even  if  communication  can 
not  be  avoided  by  use  of  the  matrix  powers  kernel,  we  could  compute  the  s  SpMVs  in  a 
straightforward  way  in  the  outer  loop  and  still  save  in  communication  costs  by  blocking  the 
dot  products  in  the  inner  loop,  as  discussed  in  [37].  When  A  can  be  well-partitioned  (and 
is  explicitly  available),  our  usual  communication-avoiding  matrix  powers  kernel  can  still  be 
used.  The  simplest  case  for  which  this  holds  is  (block)  diagonal  preconditioning,  used  in 
many  scientific  applications.  A  Krylov  subspace  for  A  can  be  computed  using  the  same 
dependencies  required  for  A  ,  as  application  of  the  diagonal  preconditioner  matrix  requires 
only  local  work.  If  the  matrix  is  dense,  but  has  a  special  structure  we  can  exploit  (i.e. , 
the  low-rank  structure  in  Hierarchical  Semiseparable  (HSS)  Matrices),  we  can  still  avoid 
communication  with  a  different  variant  of  the  matrix  powers  kernel,  which  will  be  discussed 
in  future  work. 

The  preconditioned  variants  of  our  CA-KSMs  require  a  few  additional  changes  from  the 
unpreconditioned  versions.  In  the  case  of  left-  or  split-preconditioning,  our  algorithms  will 
report  the  preconditioned  residuals,  zm  ,  instead  of  the  unpreconditioned  residuals,  rm.  This 
means  that  we  must  compute  ||rm||2  =  ||M-Wrn||2  in  order  to  check  convergence,  where  M 
is  the  left  preconditioner,  or  left  part  of  the  split  preconditioner.  The  communication  cost 
incurred  by  the  preconditioner  solve  is  a  lower  order  term  (compared  to  the  preconditioned 
matrix  powers  kernel  invocation),  assuming  M~l  is  well-partitioned. 

Others  have  discussed  the  use  of  polynomial  preconditioners  [32],  Polynomial  precondi- 
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Algorithm  12  Communication-avoiding  BICGSTAB 
Require:  Initial  approximation  xo  for  solving  Ax  =  b. 

1:  Compute  p0  =  r0  —  (b  —  Axp). 

2:  Choose  f0  arbitrary. 

3:  repeat  for  m  —  0, 1, . . 

4:  Extend  the  two  Krylov  bases  by  2s  dimensions,  using  the  matrix  powers  kernel  (Akx). 

5:  P,R  ,B  =  Akx  (A,  2s,  \pm,rm]). 

6:  Compute  g  —  Tq  P,R  . 

r  -  i-,/?  rA 

7:  Compute  G  =  P,R  P,R  . 


Initialize  coefficient  vectors 

\h°  h 1  hs]  —  ^s+1>5+1  L0  n1  „sl  _  _  02s+2,l 

ro?  uoi  ■  ■  ■  ■>  uo\  —  r  i  Luo>  uo>  •  •  •  5  uoJ  —  n  •  eo  —  i 

-D  Us+1,8+1  1 


for  j  —  0  :  s  —  1,  do 

am+j  =  gb°j/ga). 

_  Ga?-am+j(a±')H Gb±+am+jO!m+j(a±')H Ga? 

m+ ^  (bj)J/Gfej-Qm+j(fej)HGa|-am+j(a|)HGfe]+am+jam+j(a|)HGa2 

Update  candidate  solution: 


Cj+1  “I-  ^m-pj  Q  ^m-pj 

Update  residual: 

6*Li  =  b\ •  -  am+Ja^+1  -  ow* 


^m-pj^m-pj  q 


o;m+i6j+1  +  am+jujm+ja^+2 ,  for  k  =  0  :  2(s  -  j  -  1). 


Uj+ 1  —  —  —  ^m+jVj  ~r  ,  wi  fv  —  u  .  . 

Update  search  direction: 

ftm+j  =  (gbj+1/gbj)  x  (o:m-|-j/cuTO_|_j). 

aj+i  =  -  /Im+ia;m+ia)'+1,  for  k  =  0  :  2(s  -  j  -  1). 

end  for 

Recover  iterates  from  last  inner  iteration. 


T'm-ps  P)  P  Pm-\ 

%m-ps  P )  P)  %m  e8. 


Pm+s 


P,R  a°s, 


until  ||rm+s||  is  small. 

Terminate  with  approximate  solution  x  ~  xrn+s . 
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tioning  could  be  easily  incorporated  in  our  methods,  as  the  general  implementation  of  the 
matrix  powers  kernel  computes  polynomials  of  A.  Hoemmen  et  al.  have  also  devised  an  al¬ 
gorithm  to  avoid  communication  in  the  case  of  semiseparable,  hierarchical,  and  hierarchical 
semiseparable  matrices  (matrices  with  low-rank  off-diagonal  blocks)  [20,  12].  Such  matri¬ 
ces  arise  when,  for  example,  A  is  tridiagonal:  the  inverse  of  a  tridiagonal,  although  dense, 
has  the  property  that  any  submatrix  strictly  above  or  strictly  below  the  diagonal  has  rank 
1.  Hoemmen  exploits  such  low-rank  off-diagonal  blocks  in  order  to  avoid  communication 
when  applying  A,  although  he  notes  that  his  algorithm  introduces  significantly  additional 
computational  requirements.  In  future  work,  we  will  investigate  Hoemmen’s  approach  and 
determine  its  practicality,  as  well  as  investigate  communication-avoiding  approaches  for  other 
classes  of  preconditioners. 


4  Convergence 

4.1  Choice  of  Basis 

As  discussed  in  [12,  27],  the  matrix  powers  kernel  extends  the  Krylov  spaces  using  polyno¬ 
mials  p  which  may  not  be  simple  monomials.  For  simplicity,  we  only  consider  polynomials 
that  may  be  computed  by  a  three-term  recurrence.  Given  a  starting  vector  v,  we  compute  a 
sequence  of  vectors 


7U 

3  =  0 

oi\Avo  +  Vo 

3  =  1 

( atjA  +  Pjl)  vj- 1  +  7Wi-2 

3>  1 

Note  that  ay-  and  j3j  are  defined  when  3>  1  and  t j  is  defined  when  :i  >  2.  The  first  vector 
is  scaled  by  some  scalar  7.  We  rearrange  terms  to  uncover  the  identity 


Avi  = 


—  —  Vo  +  —V\ 
u  a\  ^ 

7j+ 1  _  _  Pj±l 

aj+ 1 


n _ 1  0-7 

&j+i  J  x  a. 7_i-i  4 


3  =  0 

1  •  ^ 
Vi 


or  written  in  matrix  form, 


A  [  v0 


=  [  v0  ■■■  vj+1 


aj+ 1 

V3+ 1 

72 

ai 

OL2 

J_ 

—  Pi 

ot\ 

OL2 

OL  2 

\ 


7j+i 

“j+l 

P3+1 

"j+l 

ai+ 1 


A  Vi  =  Vi+iB 


j+i&j+i 


Now  we  derive  a  change-of-basis  matrix  Bj  such  that  we  transform  from  our  polynomial 
basis  to  the  Krylov  basis: 


VjBj  =  [  v0 


}Bj=[ 


Ah  ]  =  Kj 


26 


where  Kt  is  a  Krylov  matrix.  The  first  column  of  Bj  is  obvious:  Bj  (:,  1)  =  ^ei-  Now  we 
observe  that,  when  1  <  k  <  j 


VjBj  (:,  k) 
AVjBj^k) 


Vj+iBj+iBj  (:,k) 


Kj  (:,k) 

AKj  (:,  k) 

Kj  (-.,k  +  l) 
VjBj  (:,  k  +  1) 
VjBj  (:,  k  +  1) 


Now,  we  note  that  Bj  has  a  nonzero  diagonal  and  is  upper  triangular  in  general,  since 
span  (Vi)  =  span  (A';)  Vi  >  0,  based  on  the  assumption  that  the  polynomials  pl  are  of 
degree  i  exactly.  This  means  that  the  column  vector  Bj  (:,  k)  is  zero  in  rows  (k  +  1  :  j  T  1). 

"  4+i  (1  :  k  +  1, 1  :  A;) 


Furthermore,  Bj+i  is  upper  Hessenberg,  so  BJ+1  (:,  1  :  k)  = 
will  exploit  these  zeros  by  splitting  the  matrices  blockwise: 

Vj+iBj+iBj  (:,  k) 

—  [  Vj+ 1  0, 1  :  k  +  1)  Vj+ 1  (:,  k  +  2  :  j  +  2)  ] 

x 


0,-_ 


'j—k+l,k 


We 


4+i  (1  :  k  +  1, 1  :  k) 

B  (1  :  k  +  1,  k  +  1  :  j  +  1) 

Bj  (1  :  /c,  /c) 

Oj-fc+i,j 

4+i  (k  +  2  :  j  +  2,  k  +  1  :  j  +  1) 

O7— fc+1,1 

—  Vj+ 1  (:,  1  :  k  +  1)  Vj+1  (:,  k  +  2  :  j  +  2) 


4+i  (1  •  k  +  1, 1  :  k) 

0j-/c+ij 

=  Vj+ 1  (:,  1  :  k  +  1)  4+i  (1  :  k  +  1, 1  :  k)  Bj  (1  :  k,  k) 

=  VkBj  (1  :  k  +  1, 1  :  k)  Bj  (1  :  k,  k ) 

For  the  right  hand  side,  we  have 

Bj(l  :  k  +  1,  k  +  1) 


Bj  (1  :  k,  k ) 


VjBj  (:,  k  +  1)  —  [  Vj  (:,  1  :  k  +  1)  Vj  (:,  k  +  2  :  j  +  1)  ] 

=  V}  (:,  1  :  k  +  1)  Bj  (1  :  k  +  1,  k  +  1) 

=  VkBj  (1  :  k  +  1,  k  +  1) 


Oj— k,l 


so  combining, 

VkBj  (1  :  k  +  1,  k  +  1)  =  14 4  (1  :  k  +  1, 1  :  k)  Bj  (1  :  k,  k ) 

-Bj  (1  :  k  +  1,  k  +  1)  =  Bj  (1  :  k  +  1, 1  :  k )  -Bj  (1  :  k ,  £;) 

This  last  formula  can  be  used  to  construct  the  upper  triangular  ( j  +  1)  x  (j  +  1)  change-of- 
basis  matrix  Bj  for  any  three-term  polynomial  recurrence,  starting  with  Bj  (1, 1)  =  V,  given 

the  ( j  +  1)  x  j  upper  Hessenberg  matrix  Bj.  For  an  s-step  basis,  j  =  s. 


4.1.1  Monomial  Basis 

The  simplest  basis  that  we  can  use  in  our  communication-avoiding  methods  is  the  monomial 
basis,  [x,  Ax ,  A2x , ...]  . 
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In  the  case  of  the  scaled  monomial  basis,  with  scaling  factors  (crj)sj=0  ,  we  have  7  =  A 
and  ay  leading  to 

0 

<7 1  0 

<72  0 

<7.3 

Note  that  for  the  unsealed  monomial  basis,  Bs  =  [0 ;  I]  and  Bs  =  I. 

However,  it  is  well-known  that  the  monomial  basis  converges  to  the  principle  eigenvector 
of  A,  a  property  that  is  exploited  in  the  Power  Method.  Therefore,  in  finite-precision  arith¬ 
metic,  monomial  basis  vectors  are  subject  to  becoming  linearly  dependent  if  we  choose  s  too 
large.  As  was  previously  observed  by  [9,  31],  using  a  rank-deficient  basis  leads  to  convergence 
failure,  as  the  Krylov  subspace  can  no  longer  expand.  To  remedy  this  problem,  one  must  use 
a  more  stable  basis.  Two  options  that  we  consider  are  the  Newton  basis  and  the  Chebyshev 
basis. 

4.1.2  Newton  Basis 

The  Newton  basis  can  be  written 

V  =  [. x ,  ( A  -  8J)x ,  (A  -  82)(A  -  9i)x,  ...,  (A  -  6S)(A  -  0s_i)...(A  -  81)x\ 

Here,  the  shifts,  [6b,  ...,0S],  are  taken  to  be  Leja  points  computed  using  eigenvalue  esti¬ 
mates  for  A,  as  described  in  [29].  For  the  purpose  of  convergence  results  in  this  paper,  we 
use  the  knowledge  of  the  full  spectrum  of  A  to  compute  s  Leja  points,  rather  than  eigenvalue 
estimates.  In  practice,  if  the  user  does  not  have  any  information  about  the  spectrum  of  A, 
our  method  can  be  adapted  to  perform  O(s)  steps  of  Arnoldi  and  find  the  eigenvalues  of 
the  resulting  upper  Hessenberg  matrix  H ,  giving  us  eigenvalue  estimates  for  A  (Ritz  values). 
The  Leja  ordering  routine  is  then  used  to  select  the  s  “best”  shifts  to  use.  The  “best”  shifts 
are  defined  as  being  largest  in  magnitude/furthest  away  from  each  other  (if  two  shifts  are 
close  together,  the  resulting  basis  vectors  can  again  become  linearly  dependent).  We  can 
write  the  formula  for  selecting  s  shifts  as  follows 

81  =  argmaxzeXb|2| 

3 

8j+ 1  =  argmaxzeK.  Y[\z  -  zk\ 

k= 0 

where  K}  =  {9j+1,  9j+2,  ...,  9S} 

Using  the  same  notation  as  for  the  monomial  basis,  for  the  scaled  Newton  basis,  with 
scaling  factors  ( rr; ) )  (|  and  shifts  {8j)s]=x  ,  we  have  7  =  A  anc|  which  gives 
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We  can  also  construct  the  upper  triangular  Bs  change  of  basis  matrix  without  construct¬ 
ing  Bs  (although  the  operations  to  form  Bs  are  analogous).  Start  with  Bs  =  Is,  and  build 
B  according  to  the  following  recurrence: 


/  Ox 

Vl 


Bs  = 


B(i,  j)  =  B(i  -  1,  j  -  1)  +  0i  ■  B(i,  j  -  1) 

For  example,  let’s  construct  Bs  for  s  —  4,  with  eigenvalues  estimates  [0\,  02,  03,  04],  Bs 
is  an  s  +  1  x  s  +  1  matrix.  Using  the  recurrence  above,  for  s  —  4, 


B, 


1  0i 
0  1 
0  0 
0  0 
0  0 


0\  0\  ej 

01  +  02  0\+  (01  +  02)02  0'1  +  (0\  +  (01  +  02)02)\02 
1  01  +  02  +  03  0\  +  (01  +  02)02  +  (01  +  02  +  03)03 

0  1  01  +  02  +  03  +  04 

0  0  1 


If  we  have  a  real  matrix  with  complex  eigenvalues  and  wish  to  avoid  complex  arithmetic 
(for  performance  reasons),  we  use  instead  use  the  modified  Leja  ordering  [1],  which  places 
complex  conjugate  pairs  adjacently  so  the  complex  values  disappear  from  the  computed 
basis  vectors.  The  matrix  powers  kernel  for  the  modified  Leja  ordering  is  then  used,  which 
is  described  in  [20].  For  consistency,  if  (0j-i ,0j)  form  a  complex  conjugate  pair,  we  require 
that  0j- 1  has  the  positive  imaginary  component.  We  have  7  =  y-  and 
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0j  is  part  of  a  complex  conjugate  pair 


(0j-i,0j)  are  a  complex  conjugate  pair 
otherwise 


Construction  of  the  change  of  basis  matrix  is  analogous  for  the  modified  Leja  ordering. 
Effects  of  the  modified  Leja  ordering  on  stability  have  not  been  thoroughly  investigated  in 
the  literature. 

The  construction  of  the  Newton  basis  can  result  in  underflow  or  overflow  in  finite  precision 
arithmetic,  if  the  chosen  shifts  are  too  close  together  or  too  far  apart,  respectively.  Reichel 
[30]  solves  this  problem  by  using  an  estimate  for  the  capacity  of  a  subset  of  the  complex 
plane,  defined  for  the  set  of  shifts  as 
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k- 1 

ck  =  Yl\0k-0j\1/k 

3= 1 

The  shifts  are  divided  by  this  quantity  before  an  ordering  is  chosen.  The  convergence 
results  shown  in  this  paper  do  not  use  this  capacity  estimate,  as  we  expect  that  the  matrix 
equilibration  routine  will  sufficiently  bound  the  radius  of  the  spectrum  of  A  -  implementation 
and  analysis  is  considered  future  work. 

4.1.3  Chebyshev  Basis 

Chebvshev  polynomials  are  another  attractive  choice  for  the  construction  of  the  Krylov  basis. 
Chebyshev  polynomials  have  the  property  that  over  all  real  polynomials  of  a  degree  m  on 
a  specified  real  interval  /,  the  Chebyshev  polynomial  of  degree  m  minimizes  the  maximum 
absolute  value  on  /.  This  unique  property  allows  for  optimal  reduction  of  error  in  each 
iteration,  making  many  improvements  in  IvSM  convergence  and  analysis  possible. 

•Joubert  and  Carey  [23]  present  the  recurrence  for  the  scaled  and  shifted  Chebyshev 
polynomials  as 

Po(z)  =  1 

P l(")  =  Tg{Z~C) 

Pk+i{z)  =  -  (z  -  c)Pk(z)  -  ^-Pfc_i(z) 

9  L  J 

where  the  coefficients  g ,  c  and  d  serve  to  scale  and  shift  the  spectrum  of  A  to  the  unit 
circle  (Note  that  this  only  works  for  real  matrices,  whose  center  lies  on  the  real  axis).  Using 
the  spectrum  of  A,  we  select  these  parameters  such  that  the  focus  of  the  ellipse  are  at  c±d, 
and  g  =  max(|Aj|).  Here,  we  have  7  =  1  and 


Pi 

7 j 

This  gives  us  the  matrix 
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We  can  equivalently  derive  a  formula  to  construct  the  change  of  basis  matrix  Bs  : 


£,(1,1)  =  1 

2 gB{i  -  1,  j  -  1)  +  cB(i,j  -  1)  +  fgB{i  +  1,  j  -  1) 
gB{i  -  1,  j  -  1)  +  cB(i,j  -  1)  +  fgB(i  +  1,  j  -  1) 

For  example,  for  s  —  4,  we  get  the  change  of  basis  matrix 


1  c 

c2  +  f 

c3  +  |  cd2 

c4  +  3  c2d2  +  |d4 

o  2  g 

2  eg 

\g(Ac2\d2) 

2g  (4c3  +  3  cd2) 

b4  = 

0 

2  g2 

6cg 2 

2g2(6c2  +  d2) 

0 

2  g3 

8cg3 

0 

2g4 

Note  that  this  simplifies  to  the  upper-triangular  checkerboard  pattern  if  we  take  c  =  0, 
which  means  that  the  ellipse  enclosing  the  eigenvalues  of  A  is  already  centered  around  0  on 
the  real  axis. 

4.1.4  Basis  Scaling  vs.  Matrix  Equilibration 

As  discussed  in  [20],  successively  scaling  (e.g.,  normalizing)  the  basis  vectors  reintroduces  the 
global  communication  requirement  between  SpMV  operations.  Even  if  we  wait  until  the  end 
of  the  computation  and  perform  the  scaling  of  all  basis  vectors  in  one  step,  this  still  requires 
an  all-to-all  communication  in  each  outer  loop  iteration  (or  reading  and  writing  0(n)  words  to 
and  from  slow  memory).  Although  this  does  not  asymptotically  increase  the  communication 
costs,  it  does  increase  the  constant  factors,  and  will  thus  decrease  performance  (and  can  also 
fail  to  improve  stability).  In  order  to  avoid  this  extra  cost,  we  perform  matrix  equilibration 
as  a  preprocessing  step  before  running  the  CA-KSM.  Matrix  equilibration  is  a  common 
technique  to  improve  the  spectrum  of  ill-conditioned  matrices.  Our  results,  as  well  as  results 
from  [20],  indicate  that  this  technique  is  often  good  enough  for  reducing  the  condition  number 
of  the  basis,  and  thus  we  do  not  use  the  scaled  versions  of  the  bases  described  above  (except 
in  the  case  of  Chebyshev,  where  “scaling”  is  constant  throughout  the  execution  and  refers 
to  a  scaling  of  the  spectrum  of  A  rather  than  normalization  of  the  basis  vectors  in  every 
iteration). 


i  =  2 

otherwise 
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4.2  Convergence  Results 

In  this  section,  we  present  convergence  results  which  demonstrate  that  our  Communication- 
Avoiding  methods  can  maintain  stability  and  convergence  similar  to  the  standard  implemen¬ 
tation  for  many  test  matrices.  We  present  convergence  results  for  the  two-term  variants 
of  CA-BICG  and  CA-BICGSTAB,  as  the  BIGG  method  is  the  easiest  to  analyze,  and  the 
BICGSTAB  method  is  most  used  in  practice.  All  experiments  were  run  with  [1,  1,  ...,  1]  as 
the  RHS  and  =  0  as  the  initial  guess,  which  results  in  a  starting  residual  r0  =  6,  with 

IM|2  =  ll&lla  =  y/n. 

4.2.1  Diagonal  Matrices 

We  initially  show  results  for  diagonal  matrices  to  demonstrate  some  basic  convergence  trends. 
For  diagonal  matrices,  the  only  significant  round  off  errors  in  our  algorithm  stem  from  the 
vector  and  scalar  operations  (rather  than  in  the  matrix  powers  kernel),  making  them  useful 
cases  for  analysis.  The  diagonal  matrices  were  created  with  equally  spaced  eigenvalues  on 
an  interval  chosen  to  achieve  a  given  condition  number.  We  first  explore  how  convergence 
changes  for  all  three  bases  as  a  function  of  s,  the  basis  length,  and  as  a  function  of  the 
condition  number  of  A.  Figure  1  shows  results  for  one  matrix  with  low  condition  number 
(~  10),  with  s  varying,  and  Figure  2  shows  results  for  one  value  of  s,  with  the  condition 
number  of  A  varying  for  the  CA-BICG  method. 

These  simple  examples  demonstrate  two  basic  properties  of  our  CA-KSMs.  We  can  see 
that  if  we  choose  a  high  s  value,  the  basis  vectors  can  become  linearly  dependent,  resulting 
in  failure  to  converge.  This  happens  early  on  for  the  monomial  basis,  as  is  expected.  For 
s  =  10  (a  theoretical  10 x  speedup),  all  bases  follow  the  same  iterates  as  the  standard 
implementation.  Behavior  for  other  values  of  s  “interpolates”  the  data  shown,  i.e. ,  all  bases 
will  follow  the  same  iterates  as  the  standard  implementation  for  2  <  s  <  10.  The  Newton 
and  Chebyshev  bases  (or  change  of  basis  matrices)  eventually  become  ill-conditioned  as  well, 
but  are  able  to  reproduce  the  standard  iterates  using  a  larger  Krylov  subspace  than  the 
monomial  basis.  From  Figure  2,  it  is  clear  that,  like  the  standard  KSMs,  convergence  of  our 
CA-KSMs  is  (somewhat)  dependent  on  the  condition  number  of  A.  It  is  also  important  to 
notice,  in  Figure  1,  the  stagnation  that  occurs  in  all  three  bases  for  s  =  20.  This  behavior 
is  due  to  build  up  of  round-off  errors  in  the  computed  residual,  which  causes  significantly 
deviation  from  the  true  residual.  This  behavior  worsens  (stagnation  occurs  sooner)  with 
increasing  s.  We  will  discuss  remedies  for  this  in  Section  4.2.3. 
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Figure  1:  Convergence  of  Diagonal  Matrices  for  Various  Values  of  s.  N  =  1000,  cond  10. 
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Figure  2:  Convergence  of  Diagonal  Matrices  for  Various  Condition  Numbers,  (a)  10,  (b)  103, 
(c)  105.  N  =  1000.  Eigenvalues  are  uniformly  distributed. 

4.2.2  Results  for  Test  Matrices  from  Various  Applications 

We  have  performed  convergence  tests  for  both  CA-BICG  and  CA-BICGSTAB  for  the  a  small 
set  of  test  matrices  from  the  University  of  Florida  Sparse  Matrix  Collection  [11],  In  these 
tests,  the  matrix  equilibration  routine  was  used  (when  necessary).  As  above,  the  starting 
vector  (r0  =  b )  used  was  a  vector  of  all  one’s,  and  the  initial  guess  x0  =  0. 

mesh2el  The  matrix  mesh2el  is  a  discretized  structural  problem  from  NASA.  The  dimen¬ 
sion  of  the  matrix  is  306,  with  2,  018  nonzeros.  This  matrix  is  symmetric  positive  definite 
(SPD),  and  is  well-conditioned  (condition  number  ~  400).  Although  BICG  and  BICGSTAB 
work  for  nonsymmetric  matrices,  we  begin  with  this  example  as  the  convergence  behavior  is 
smoother  and  more  predictable  for  SPD  matrices. 

Figure  3  shows  the  eigenvalues  of  this  matrix  (blue  x’s)  along  with  the  first  30  Leja 
points  (black  o’s).  The  plot  on  the  right  shows  the  condition  number  of  the  monomial  basis, 
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Newton  basis  (computed  using  the  shown  Leja  points),  and  the  Chebyshev  basis  (where  the 
bounding  ellipse  was  calculated  using  the  exact  eigenvalues),  versus  the  size  of  the  basis. 


Figure  3:  Spectrum  and  Leja  Points  for  mesh2el  (left).  Basis  Condition  Number  vs.  Basis 
Length  (right). 
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Figure  4:  Convergence  of  CA-BICG  for  mesh2el. 
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Figure  5:  Convergence  of  CA-BICGSTAB  for  mesh2el. 


pde900  This  matrix  is  a  5-pt  central  difference  discretization  of  a  2D  variable-coefficient 
linear  elliptic  equation  on  the  unit  square  with  Dirichlet  boundary  conditions,  with  NX  = 
NY  =  30.  It  is  well-conditioned,  with  condition  number  ~  300.  Here,  N  =  900,  and  there  are 
4,  380  nonzeros.  This  matrix  is  structurally  symmetric,  with  50%  numeric  value  symmetry. 
This  matrix  is  not  positive  definite.  Figure  6  below  shows  the  eigenvalues  of  this  matrix, 
selected  Leja  points,  and  resulting  condition  number  of  the  basis  for  various  bases  and  basis 
lengths.  Figures  7  and  8  show  convergence  results  for  CA-BICG  and  CA-BICGSTAB  for 
the  different  bases,  for  increasing  values  of  s. 
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Figure  6:  Spectrum  and  Leja  Points  for  pde900  (left).  Basis  Condition  Number  vs.  Basis 
Length  (right). 


Figure  7:  Convergence  of  CA-BICG  for  pde900. 
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Figure  8:  Convergence  of  CA-BICGSTAB  for  pde900. 


fs_680_l  This  matrix  is  an  unsymmetric  facsimile  convergence  matrix.  It  is  slightly  less 
well-conditioned,  with  condition  number  ~  104.  After  the  matrix  equilibration  routine,  the 
condition  number  was  lowered  to  ~  103.  This  matrix  is  not  positive  definite,  and  is  not 
symmetric  in  structure  or  in  value.  Here,  N  =  680,  and  there  are  2, 184  nonzeros.  Figure 
9  below  shows  the  eigenvalues  of  this  matrix,  selected  Leja  points,  and  resulting  condition 
number  of  the  basis  for  various  bases  and  basis  lengths.  Figures  10  and  11  show  convergence 
results  for  CA-BICG  and  CA-BICGSTAB  for  the  different  bases,  for  increasing  values  of  s. 
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Figure  9:  Spectrum  and  Leja  Points  for  fs_680_l  (left).  Basis  Condition  Number  vs.  Basis 
Length  (right). 


CABiCG  Convergence,  s  =8 


Figure  10:  Convergence  of  CA-BICG  for  fs_680_l. 
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Figure  11:  Convergence  of  CA-BICGSTAB  for  fs_680_l. 


4.2.3  Effect  on  Maximum  Attainable  Accuracy 

In  the  previous  section,  we  tested  for  convergence  of  the  true  residual  to  a  precision  of  ICC6 
(i.e. ,  ||rm||2  <  1CT6  ).  In  practice,  however,  higher  precision  may  be  necessary.  We  see  from 
the  convergence  plots  in  the  previous  section  that,  for  high  s  values,  for  both  CA-BICG 
and  CA-BICGSTAB,  our  CA  method  is  unable  to  converge  to  the  same  tolerance  as  the 
standard  implementations  because  the  convergence  stagnates.  This  effect  is  due  to  floating 
point  round  off  error,  which  causes  a  discrepancy  between  the  true  and  computed  residuals 
(similar  plots  of  the  computed  residual  would  appear  to  continue  convergence).  This  problem 
has  been  observed  before,  and  is  known  to  especially  plague  2-sided  KSMs  [17].  We  observe 
here  that  this  problem  is  also  present  in  the  CA  variants  of  these  methods,  and  is  in  fact 
exacerbated  the  larger  the  s  value  chosen  (this  behavior  is  expected  to  worsen  in  the  3-term 
variants  of  the  method  [17]).  As  restarting  has  been  used  effectively  to  combat  this  problem 
in  the  BICG  method  [34],  we  extend  this  technique  to  our  CA-KSMs.  In  the  remainder 
of  this  section,  we  describe  the  problem  of  round  off  in  our  CA-KSMs,  and  present  results 
which  indicate  that  restarting  prevents  buildup  of  round  off  error  and  allows  for  convergence 
to  a  higher  precision. 
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This  decrease  in  maximum  attainable  accuracy  is  due  to  using  a  basis  matrix  (or  change  of 
basis  matrix  with  a  high  condition  number),  which  is  (more  than)  squared  in  the  construction 
of  the  Gram-like  matrix.  The  presence  of  very  large  (and  very  small)  values  in  the  Gram¬ 
like  matrix  leads  to  floating  point  round  off  errors  when  the  recurrence  coefficients  are  used 
to  recover  the  monomial  basis.  In  the  plots  for  unrestarted  CA-BICGSTAB,  we  see  that 
the  monomial  basis  can  sometimes  achieve  a  higher  maximum  attainable  accuracy  than 
the  Chebyshev  or  Newton  bases  (although  it  deviates  from  the  standard  algorithm  iterates 
sooner  in  the  execution).  This  is  due  to  increased  round  off  error  in  the  construction  of  and 
multiplication  by  Bs ,  the  change  of  basis  matrix,  in  both  Chebyshev  and  Newton  bases. 

If  the  spectrum  of  A  is  large,  consecutive  shifts  used  in  the  Newton  basis  may  differ  by 
orders  of  magnitude,  and  ellipse  parameters  chosen  for  the  Chebyshev  basis  may  be  large, 
leading  to  an  ill-conditioned  Bs.  By  choosing  parameters  such  that  the  condition  number 
of  the  basis  matrix  is  as  small  as  possible,  we  have  also  chosen  parameters  which  will  make 
the  condition  number  of  the  change  of  basis  matrix  high.  We  expect  results  for  the  Newton 
basis  may  be  somewhat  improved  by  using  Reichel’s  capacity  estimate  in  computing  the 
(modified)  Leja,  ordering  [30].  Further  experiments  are  required. 

For  the  monomial  basis,  however,  Bs  =  Is+ 1,  so  the  change  of  basis  matrix  is  always 
well-conditioned  irrespective  of  the  spectrum  of  A.  Here,  we  have  created  a  change  of  basis 
matrix  with  very  low  condition  number,  which  results  in  a  basis  matrix  with  a  high  condition 
number.  This  effect  is  more  prevalent  in  CA-BICGSTAB  than  in  CA-BICG  due  to  the 
use  of  three  (rather  than  two)  Krylov  bases  in  the  construction  of  the  Gram-like  matrix, 
which  leads  to  a  significantly  higher  condition  number.  We  expect  that  using  the  variant  of 
CA-BICGSTAB  which  only  requires  two  RHSs  for  the  matrix  powers  kernel  would  lead  to 
behavior  closer  to  that  of  CA-BICG. 

Restarting  has  been  used  as  a  way  to  combat  this  problem  in  the  BICG  method,  with 
good  results  [34],  In  a  similar  attempt,  we  employ  restarting  of  our  CA-KSMs.  Our  results 
are  shown  in  Figures  12  and  13,  which  shows  both  the  unrestarted  and  restarted  methods. 
To  restart,  we  take  the  current  x  value  and  restart  the  algorithm  using  this  x  as  the  ini¬ 
tial  guess,  which  prevents  the  buildup  of  floating  point  errors.  Many  techniques  exist  for 
dynamically  choosing  when  the  method  should  be  restarted,  e.g.,  [34],  by  computing  the 
true  residual  b  —  Ax  after  some  number  of  iterations  and  monitoring  the  discrepancy.  In 
our  experiments,  we  manually  experimented  to  find  a  good  restart  value  for  each  test  case. 
Implementing  a  dynamically  selected  restart  length  (in  a  communication-avoiding  way)  is 
future  work.  It  should  be  noted  that  when  the  algorithm  is  restarted,  there  is  an  expected 
plateau  before  convergence  is  seen  again,  as  the  previous  information  we  had  is  lost.  This 
can  be  combated  by  using  deflation  techniques  which  require  knowledge  of  eigenvectors  for 
the  smallest  eigenvalues.  This  is,  again,  considered  future  work. 
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Figure  12:  CA-BICG  (top)  and  CA-BICGSTAB  Convergence  (bottom)  for  mesh2el  Matrix, 
for  s  =  20.  s  x  t  is  the  chosen  restart  length.  Plots  on  left  show  convergence  without 
restarting,  plots  on  right  show  convergence  with  restarting  technique. 
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Figure  13:  CA-BICG  (top)  and  CA-BICGSTAB  Convergence  (bottom)  for  fs_680_l  Matrix, 
for  s  =  20.  s  x  t  is  the  chosen  restart  length.  Plots  on  left  show  convergence  without 
restarting,  plots  on  right  show  convergence  with  restarting  technique. 

4.2.4  Further  Techniques  for  Improving  Convergence:  TSQR 

Another  technique  for  improving  convergence  involves  the  use  of  TSQR  [12].  This  kernel 
has  two  uses  in  our  algorithms.  First,  we  could  use  the  output  of  this  kernel  to  construct 
an  orthogonal  basis  for  the  Krylov  Subspace.  To  perform  this  orthogonalization  technique, 
we  perform  a  TSQR  operation  on  the  basis  vectors  after  they  are  generated  by  the  matrix 
powers  kernel,  once  per  outer-loop  iteration.  As  this  is  a  communication-avoiding  kernel,  we 
do  not  asymptotically  increase  the  communication  cost,  although  the  constants  grow  larger, 
and  the  computation  cost  increases.  We  then  construct  the  Q  basis  matrix  from  the  output 
of  the  kernel,  and  instead  use  this  orthogonalized  basis  in  the  s  inner  loop  iterations.  We 
can  easily  update  the  change  of  basis  matrix  by  pre-multiplication  by  the  R  factor. 

It  should  be  noted  that  this  approach  does  not  reintroduce  dependencies  between  inner 
and  outer  loops  in  the  Krylov  Subspace  Method,  as  occurs  when  computing  scaled  bases, 
mentioned  in  Section  2.  In  work  describing  computing  a  scaled  basis,  each  basis  vector 
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is  orthogonalized  against  all  previous  basis  vectors  as  they  are  computed.  In  using  the 
TSQR  kernel  for  orthogonalization,  we  first  compute  all  basis  vectors,  and  then  perform  the 
orthogonalization.  Although  this  approach  can  result  in  more  round-off  error,  it  allows  us 
to  preserve  our  communication-avoiding  strategy. 

Our  initial  results  have  indicated  that  orthogonalization  helps  convergence  only  minimally 
in  the  case  of  CA-BICGSTAB,  and  also  only  minimally  for  the  Newton  and  Chebyshev  bases 
in  the  CA-BICG  method.  Orthogonalization  also  does  not  prevent  the  stagnation  due  to 
round  off  error.  The  restarting  technique  described  in  the  previous  section  is  still  necessary. 
Restarting  alone,  however,  does  not  allow  for  convergence  when  using  the  monomial  basis  in 
the  CA-BICG  method.  In  this  case,  both  orthogonalization  and  restarting  are  needed,  and 
when  both  these  techniques  are  used,  the  monomial  basis  (which  can  really  no  longer  be  called 
the  “monomial  basis”)  behaves  just  as  well  as  the  Newton  and  Chebyshev  bases.  Because 
the  orthogonalization  through  the  TSQR  operation  requires  another  communication  in  each 
outer  loop,  we  recommend  that  this  technique  only  be  used  in  the  CA-BICG  algorithm 
with  the  monomial  basis.  The  monomial  basis  might  be  an  attractive  choice  if  the  user 
does  not  have  good  eigenvalue  estimates  to  compute  Leja,  points  or  a  bounding  ellipse  for 
the  spectrum  of  A,  as  are  needed  for  the  Newton  and  Chebyshev  bases,  respectively.  If 
using  CA-BICGSTAB,  or  CA-BICG  with  Newton  and  Chebyshev,  our  results  indicate  that 
restarting  the  algorithm  is  sufficient  for  regaining  convergence. 

Another  potential  use  of  this  kernel  is  dynamically  changing  the  number  of  inner  loop 
iterations,  s,  based  on  the  condition  number  of  the  R  matrix.  This  prevents  the  method 
from  attempting  to  perform  s  iterations  of  the  Krylov  method  with  a  rank-deficient  basis 
matrix,  which  will  occur  if  the  basis  length  is  chosen  to  be  too  long  (especially  for  the 
monomial  basis).  If  the  user  does  not  supply  an  s  value,  or  is  unsure  of  the  best  s  to  use 
for  convergence,  this  added  TSQR  operation  can  allow  us  to  dynamically  change  the  s  value 
during  each  outer  loop  iteration  to  ensure  convergence.  After  the  TSQR  operation,  the  R 
factor  is  distributed  to  each  processor  (R  is  small,  0(s )  x  0(s)).  Each  processor  can  then, 
without  communication,  compute  how  many  steps  of  the  algorithm  it  is  safe  to  take  by 
incrementally  finding  the  rank  of  R  (or  incrementally  estimating  the  condition  number)  - 
when  the  rank  of  the  first  r  columns  is  less  than  r,  we  stop  and  set  s  —  r  —  1.  Therefore  we 
can  be  sure  that  our  basis  vectors  are  not  linearly  dependent,  and  convergence  will  still  be 
possible. 


5  Implementation  Details 

5.1  Matrix  Powers  Kernel  Variants  for  Multiple  RHSs  and  AH 

In  the  CA-CG  and  CA-GMRES  methods  studied  by  [20,  27],  only  one  matrix  powers  ker¬ 
nel  operation  is  needed  with  the  matrix  A.  In  contrast,  our  CA-BICG  method  requires  4 
matrix  powers  kernel  operations  -  two  vectors  with  A  and  two  vectors  with  AH  (only  one 
RHS  is  needed  for  the  3-term  recurrence  variant),  and  CA-BICGSTAB  requires  3  matrix 
powers  kernel  operations  -  three  different  vectors  with  A.  In  practice,  although  the  storage 
costs  increase  by  a  constant,  the  communication  cost  does  not  increase.  We  can  perform  the 
matrix  powers  kernel  operation  with  multiple  RHSs  still  reading  the  matrix  A  only  once. 
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Here,  we  perform  s  Sparse  Matrix-Matrix  Multiplications  (SpMMs)  instead  of  s  SpMVs. 
Vuduc  [40]  has  implemented  an  optimized  version  of  the  SpMM  kernel  which  only  requires 
reading  A  once  to  operate  on  all  RHSs.  Such  an  implementation  of  the  matrix  powers 
kernel  (one  that  instead  computes  {B,  AB:  A2B ,  ASB}  )  would  also  allow  for  develop¬ 
ment  of  Communication- Avoiding  Block  Krylov  Methods  in  which  many  systems  are  solved 
simultaneously  with  the  same  matrix. 

We  can  also  perform  the  matrix  powers  kernel  operations  on  both  A  and  AT  reading  the 
matrix  A  only  once  (simply  by  switching  the  direction  of  the  arrows  in  the  layered  graph 
of  dependencies).  Because  of  these  further  reductions  in  required  communication  vs.  the 
standard  implementations,  we  expect  to  see  a  greater  performance  gain  for  IvSMs  which 
require  multiple  Krylov  subspaces  using  a  specialized  matrix  powers  kernel. 

Because  the  required  functionality  of  the  matrix  powers  kernel  varies  depending  on  which 
CA-KSM  is  used,  auto-tuning  and  code  generation  is  useful.  There  are  currently  two  active 
projects  dealing  with  this  problem.  One  approach  involves  incorporation  of  the  matrix  powers 
kernel  into  pOSKI,  an  auto-tuned  library  for  sparse  matrix  computations  [22],  Another 
involves  Selective  Embedded  Just-In-Time  Specialization  (SEJITS),  in  which  different  code 
variants  are  selected  dynamically  during  execution  based  on  the  run-time  parameters  [7]. 

5.2  Finding  Eigenvalue  Estimates 

In  our  numerical  experiments,  we  assume  full  knowledge  of  the  spectrum  (as  this  shows  the 
best  we  can  hope  to  do  using  these  bases),  but  in  practice  the  user  might  not  have  any 
information  about  the  eigenvalues  of  A.  In  this  case,  if  a  Newton  or  Chebyshev  basis  is  to 
be  used  in  the  computation,  we  must  first  find  eigenvalue  estimates  for  A. 

This  is  commonly  accomplished  by  taking  s  (or  O(s))  steps  of  Arnoldi  to  get  the  upper 
Hessenberg  matrix  H.  The  eigenvalues  of  H,  called  Ritz  values,  are  often  good  estimates  for 
the  eigenvalues  of  A. 

Using  the  Ritz  values,  we  can  then  either  computed  the  (modified)  Leja  points  for  use 
in  the  Newton  basis  or  find  good  bounding  ellipse  parameters  for  use  in  the  Chebyshev 
basis  [29].  It  should  be  noted  that,  to  use  Joubert  and  Carey’s  formula  for  the  Chebyshev 
basis,  we  really  only  need  the  maximum  and  minimum  real  eigenvalues  and  largest  imaginary 
eigenvalue  (which  define  the  ellipse).  Here,  instead  of  computing  s  eigenvalue  estimates,  we 
might  be  able  to  make  due  with  fewer  Arnoldi  iterations  designed  to  find  these  quantities. 

While  we  don’t  need  to  be  completely  accurate  in  these  estimates,  problems  can  occur  in 
extreme  cases,  depending  on  the  spectrum  of  A.  If  Arnoldi  gives  you  the  s  largest  eigenvalues 
and  these  are  close  together,  the  Newton  basis  will  be  ill-conditioned,  as  the  basis  vectors 
will  be  nearly  linearly  dependent.  Additionally,  an  ill-conditioned  change  of  basis  matrix  will 
result  if  the  chosen  shifts  vary  by  many  orders  of  magnitude  (here  the  ill-conditioning  of  the 
monomial  basis  has  just  been  transferred  to  the  change  of  basis  matrix).  It  is  also  known 
that  the  Chebyshev  basis  does  poorly  if  the  eigenvalues  are  clustered  inside  the  ellipse. 

5.3  Choice  of  Basis  in  Practice 

Based  on  observations  about  the  convergence  of  the  CA-IvSMs  studied  in  this  paper,  we  offer 
some  preliminary  guidance  in  choosing  a  basis  in  practical  execution  of  these  methods.  In 
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practice,  if  s  >  5  is  required,  either  Newton  or  Chebyshev  basis  should  be  used  if  the  user 
knows  (some  of)  the  eigenvalues  of  A  or  can  afford  to  compute  estimates,  as  the  monomial 
basis  rarely  convergences  in  this  case  (unless  the  matrix  is  extremely  well  conditioned). 
Even  though  some  runtime  may  be  lost  computing  eigenvalue  estimates  and  performing 
more  floating  point  operations  in  recovering  the  monomial  basis,  the  ability  to  choose  a 
larger  value  of  s  (assuming  As  is  still  well-partitioned)  may  still  result  in  a  net  performance 
gain. 

Another  option  for  higher  s  values  is  to  use  the  monomial  basis  with  orthogonalization 
(by  performing  a  TSQR  operation  on  the  basis  vectors,  described  in  section  4.2.4),  although 
this  requires  an  additional  communication  in  the  outer  loop. 

If  we  can  only  use  a  small  value  of  s  (based  on  the  structure  and  density  of  As)  or  if  we 
have  a  very  well-conditioned  matrix,  the  monomial  basis  will  usually  perform  on  par  with 
the  Chebyshev  and  Newton  bases.  In  this  case,  the  monomial  basis  is  the  obvious  choice, 
since  fewer  floating  point  operations  are  required  in  the  CA-KSM. 

In  future  implementations  of  these  CA-KSMs,  the  user  will  not  be  burdened  with  making 
these  decisions.  Ideally,  an  auto-tuner  could  be  used  to  select  an  appropriate  code  variant 
that  achieves  both  performance  and  stability  based  on  matrix  structure,  eigenvalue  estimates, 
and  other  properties  of  A.  The  study  and  development  of  heuristics  to  accomplish  such  a 
task  is  future  work. 


6  Future  Work  and  Conclusions 

6.1  Conclusions  and  Comments  on  Convergence 

In  this  work,  we  have  developed  three  new  CA-KSMs:  CA-BICG,  CA-CGS,  and  CA- 
BICGSTAB.  We  have  implemented  the  monomial,  Newton  and  Chebyshev  bases  and  studied 
convergence  properties  for  a  variety  of  matrices.  We  are  able  to  use  basis  length  2  <  s  <  20, 
depending  on  the  matrix  properties,  where  using  a  basis  length  of  s  allows  for  up  to  an  sx 
speedup. 

Convergence  results  indicate  that,  like  standard  implementations,  our  methods  suffer 
from  round  off  error.  The  higher  the  s  value  used,  the  more  prominent  this  effect,  which 
decreases  maximum  attainable  accuracy.  To  combat  this  behavior,  we  have  explored  restart¬ 
ing  in  CA-KSMs  as  a  method  to  reduce  stagnation  due  to  deviation  of  true  and  computed 
residuals,  with  promising  results.  We  have  also  discussed  the  use  of  using  the  TSQR  kernel 
to  orthogonalize  the  basis  and  dynamically  select  s,  although  whether  or  not  this  is  practical 
is  yet  to  be  determined. 

Based  on  our  initial  observations,  we  have  provided  details  and  advice  for  the  practical 
implementation  of  CA-KSMs,  including  necessary  preprocessing  steps  such  as  selecting  a 
polynomial  basis  for  stability  purposes.  Additionally,  we  have  shown  how  to  further  reduce 
communication  by  using  a  variant  of  the  Akx  kernel  that  does  s  SpMMs  instead  of  s  SpMVs 
if  we  have  a  method  that  requires  more  than  one  Krylov  Subspace.  This  kernel  can  also  be 
used  to  do  CA-Block  IvSM  versions  of  our  methods,  to  solve  multiple  systems  simultaneously. 
Decisions  such  as  these  will  eventually  be  handled  by  a  combination  of  auto-tuning  and  code 
generation. 
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6.2  Future  Work 


Although  initial  convergence  results  are  promising,  there  is  much  remaining  work  involved  in 
understanding,  analyzing  and  improving  convergence,  as  well  as  in  creating  efficient  imple¬ 
mentations  of  these  methods,  benchmarking  performance,  and  developing  auto-tuner  heuris¬ 
tics.  Making  CA-IvSMs  practical,  efficient,  and  available  for  use  will  require  much  time  and 
effort.  The  three  areas  of  future  work  listed  below  are  considered  to  be  priorities  in  order  to 
meet  this  goal. 

Performance  Experiments  Our  next  step  will  be  to  code  parallel,  optimized  implemen¬ 
tations  (both  sequential  and  parallel)  of  our  CA-IvSMs  and  required  kernels.  We  will  then 
benchmark  performance  for  a  variety  of  matrices,  on  a  variety  of  platforms.  Although  we 
have  shown  that  communication  is  theoretically  asymptotically  reduced  in  our  algorithms, 
we  must  also  see  how  our  algorithms  behave  in  practice  (how  much  redundant  work  is  re¬ 
quired,  where  the  optimal  s  value  is  depending  on  matrix  surface-to-volume  ratio,  etc).  Such 
an  analysis  will  determine  the  most  important  areas  to  work  on  moving  forward. 

Auto-tuning  and  Code  Generation  for  Matrix  Powers  Many  complex  optimizations 
and  implementation  decisions  are  required  for  both  matrix  powers  kernel  performance  and 
convergence  of  CA-IvSMs.  These  optimizations  are  both  machine  and  matrix  dependent, 
which  makes  auto-tuning  and  code  generation  an  attractive  approach.  Employing  such 
techniques  will  lift  the  burden  of  expertise  from  the  user,  making  the  use  of  CA-IvSMs  more 
accessible.  Many  approaches  and  ongoing  projects  exist  for  this  purpose  [7,  22,  24],  We  also 
plan  to  apply  our  methods  to  applications  (e.g.,  as  a  bottom-solver  in  multigrid  methods),  in 
conjunction  with  collaborators  at  the  Communication  Avoidance  and  Communication  Hiding 
at  the  Extreme  Scale  (CACHE)  Institute,  an  interdisciplinary  project  funded  by  the  U.S. 
Department  of  Energy.  We  hope  that  by  further  study  of  the  convergence  and  performance 
properties  of  our  CA-IvSMs,  we  can  contribute  to  the  design  of  auto-tuners  and  heuristics, 
and  the  successful  integration  of  our  work  into  existing  frameworks. 

Techniques  to  Further  Improve  Stability  In  this  work,  we  presented  a  qualitative 
analysis  of  initial  convergence  results  for  a  small  set  of  test  matrices.  A  more  rigorous 
analysis  of  convergence  properties  for  CA-IvSMs  will  be  necessary,  and  helpful  in  determining 
new  methods  for  improving  stability.  Current  ideas  for  improving  stability  include  using 
techniques  such  as  dispersive  Lanczos  [36]  for  finding  better  eigenvalue  estimates,  using 
extended  precision  in  (perhaps  only  part  of)  the  CA-IvSM  iterations,  and  further  exploring 
the  use  of  variable  basis  length,  incremental  condition  estimation,  and  orthogonalization 
techniques. 
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